STA623 - Bayesian Data Analysis

Author

Marc Henrion

Published

September 2025

Preliminaries

Notes

  • These notes were written in quarto.

  • All examples / code in these notes is R and a NIMBLE / BUGS for Bayesian model specification.

  • GitHub repository - will contain all course materials by the end of the week:

    https://github.com/gitMarcH/UNIMA_STA623_2025

Some general references for Bayesian statistics / data analysis:

  1. Hoff, P.D. (2009). “A First Course in Bayesian Statistical Methods.” Springer.

  2. Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B. (2014). “Bayesian Data Analysis”. 3rd ed. CRC Press.

  3. Ramoni, M., Sebastiani, P. (2007), ‘Bayesian Methods’, in Berthold, M., Hand, D.J. (eds.). “Intelligent Data Analysis”, 2nd ed., Springer, pp.131-168

  4. Stone, J.V. (2013). “Bayes’ Rule: A Tutorial Introduction to Bayesian Analysis”. Sebtel Press.

Notation

  • \(X, Y, Z\) - random variables

  • \(x, y, z\) - measured / observed values

  • \(\bar{X}\), \(\bar{Y}, \bar{Z}\) - sample mean estimators for X, Y, Z

  • \(\bar{x}\), \(\bar{y}, \bar{z}\) - sample mean estimates of X, Y, Z

  • \(\hat{T}\), \(\hat{t}\) - given a statistic T, estimator and estimate of T

  • \(P(A)\) - probability of an event A occurring

  • \(f_X(.)\), \(f_Y(.), f_Z(.)\) - probability mass / density functions of X, Y, Z; sometimes \(p_X(.)\) etc. rather than \(f_X(.)\)

  • p(.) - used as a shorthand notation for pmfs / pdfs if the use of this is unambiguous (i.e. it is clear which is the random variable)

  • \(X\sim F\) - X distributed according to distribution function F

  • \(E[X]\), \(E[Y]\), \(E[Z]\), \(E[T]\) - the expectation of X, Y, Z, T respectively

Recap of probability theory

This section is largely based on and in places quoted verbatim from

Feelders, Ad J. (2007), ‘Statistical Concepts’, in Berthold, M., Hand, D.J. (eds.) Intelligent Data Analysis, 2nd ed., Springer, pp.17-68

Introduction

Probabilities can be used informally to express information and our beliefs about unknown quanities.

This can be made formal.

\[\,\]

Probabilities can be used to express rational beliefs and there is a relationship between probability and information.

Bayes’ rule provides a rational way of updating beliefs in the light of new information.

This process of inductive learning is referred to as Bayesian inference.

\[\,\]

Bayesian methods provide

  • Statistical estimators with desirable properties.

  • Parsimonious descriptions of data.

  • A computational framework for model estimation, selection and validation.

There are 3 main paradigms for statistical inference:

  • Frequentist paradigm

    The fundamental theorem underlying this is the Law of Large Numbers. Inference results from the asymptotic case where an experiment would be repeated infinitely many times.

  • Bayesian paradigm

    The fundamental theorem underlying this is Bayes’ Theorem. Probabilities encode subjective beliefs that get updated (objectively) by observed data.

  • Fisherian / fiducial paradigm

    Based on rules which allow to draw conclusions from samples of data. In the present day, this inference paradigm is very rarely used. We will not focus on this.

Frequentist paradigm

  • Parameters are fixed but unknown and we focus on quantifying uncertainty regarding these fixed parameters.

  • Probabilities are always interpreted as long run relative frequency.

  • Procedure is judged by how well they perform in the long run over an infinite number of hypothetical repetitions of the experiment.

Bayesian paradigm

  • Parameters are considered to be random variables.

  • Probability statements about parameters must be interpreted as “degrees of belief”.

  • We revise our beliefs about parameters after getting the data by using Bayes Theorem.

  • Yields posterior parameter distribution - conditional on the particular dataset that was observed.

Random experiments

A random experiment is an experiment that satisfies the following conditions:

  1.  All possible outcomes are known in advance.
  2.  In any particular trial, the outcome is not known in advance.
  3.  The experiment can be repeated under identical conditions.

The outcome space or universe \(\Omega\) of an experiment is the set of all possible outcomes of the experiment.

\[\,\]

Examples

  • In the coin tossing experiment earlier \(\Omega=\{H,T\}\).
  • When you roll a die \(\Omega=\{1,2,3,4,5,6\}\).

An event is a subset of the outcome space.

\[\,\]

Examples

  • “Coin lands head”: \(A = \{x\in\Omega | \,x\mbox{ is heads}\} = \{H\}\)
  • “Die shows even number”: \(B = \{x\in\Omega | \,x\mbox{ is even}\} = \{2,4,6\}\)

\[\,\]

Special events

  • Impossible / empty event: \(A = \emptyset\)
  • Sure event / outcome space: \(A = \Omega\)
  • Singleton events: \(A = \{H\}\), \(A = \{3\}\)
  • The complementary event: \(\bar{A}=\Omega\setminus A\)

Probability

Definition

Classical definition of probability

Let \(|.|\) denote the operator measuring the size of an event. The probability of an event \(A\subseteq\Omega\) is defined as \[P(A)=\frac{|A|}{|\Omega|}\]

If all outcomes in \(\Omega\) are equally likely, then this means the probability of \(A\) is the ratio of the number of outcomes in \(A\) and the number of outcomes in \(\Omega\).

If your outcome space is not discrete, then \(|.|\) is a function mapping outcome sets to the positive real line.

Frequency definition of probability

It is supposed an experiment is repeated \(k\) times, producing an outcome \(o_i\) during the ith run. Probability is the defined as the long-run relative frequency:

\[P(A)=\mbox{lim}_{k\rightarrow\infty}\frac{\sum_i I(o_i\in A)}{k}\] where \(I(.)\) is the indicator function (1 if its argument is true, 0 otherwise).

Subjective definition of probability

According to this definition, probability is a measure of the degree of belief that an event \(A\) will occur.

Degree of belief depends on the person who has the belief, so with this definition the probability \(P(A)\) can be different for different people.

The subjective definition of probability allows expressing all uncertainty through probability - this is important for Bayesian statistics.

History

Probability as a mathematical concept was formally introduced in the 17th century by French mathematicians Blaise Pascal and Pierre de Fermat when they were discussing games of chance.

The formal, mathematical derivation of probability theory follows from set theory and measure theory.

Blaise Pascal (public domain / Wikipedia)

Pierre de Fermat (public domain / Wikipedia)

Probability axioms

Probability (whether according to the classical, frequency or subjective definition) is a function \(P(.)\) from subsets \(A\) of \(\Omega\) to the real line satisfying the following axioms:

  1.  \(P(A)\geq0\), \(\forall A \subseteq\Omega\)
  2.  if \(A\cap B=\emptyset\), then \(P(A\cup B) = P(A) + P(B)\), \(\forall A,B\subseteq\Omega\)
  3.  \(P(\Omega)=1\)

Everything else in probability theory is derived from these 3 axioms.

Conditional probability

The probability of an event \(B\) can be influenced by information about the occurrence of an event \(A\). The conditional probability of \(B\) given \(A\), denoted \(P(B|A)\), is defined as the probability of event \(B\) given that \(A\) has occurred. For \(P(A)>0\):

\[P(B|A) = \frac{P(A\cap B)}{P(A)}\]

Intuitively: \(A\) is the new, reduced universe / outcome space \(\Omega_r\). The division by \(P(A)\) guarantees that the conditional distribution sums / integrates to 1, i.e. is a valid probability distribution.

From the conditional probability, we can derive the multiplication rule:

\[P(A\cap B)=P(B|A)\cdot P(A)\]

Independence

Events \(A\) and \(B\) are said to be independent if the occurrence of one event does not influence the probability of occurrence of the other event.

\[P(A|B)=P(A)\] \[P(B|A)=P(B)\]

This can more concisely be expressed as:

\[P(A\cap B)=P(A)P(B)\]

Law / Theorem of Total Probability

We define events \(B_1,B_2,\ldots,B_n\subseteq\Omega\) to form a partition of \(\Omega\) if

  • \(B_i\cap B_j=\emptyset\), \(\forall i\neq j\)
  • \(\bigcup_{i=1}^n B_i = \Omega\)

From the probability axioms it follows that, for any event \(A\subseteq\Omega\):

\[P(A) = \sum_{i=1}^n {P(A|B_i)\,P(B_i)}=\sum_{i=1}^n {P(A\cap B_i)}\]

This is known as the Theorem of Total Probability.

Example

A box contains 4 balls: 3 white, 1 red.

First draw one ball at random. Then, without replacing the first ball, draw a second ball from the box.

What is the probability that the second ball is a red ball?

\[\,\]

This is most easily calculated using the TTP.

Let \(R_1, R_2\) be the event of drawing a red ball on the first / second draw, and similarly for \(W_1, W_2\).

Note that \(\bar{R}_1=W_1\), and hence \(R_1, W_1\) form a parition of \(\Omega\).

\[P(R_2)=P(R_2|W_1)P(W_1)+P(R_2|R_1)P(R_1)=\frac{1}{3}\cdot\frac{3}{4}+0\cdot\frac{1}{4}=\frac{1}{4}\]

Bayes’ Rule

Bayes’ Theorem shows how probabilities change in light of evidence:

\[P(B|A)=\frac{P(A|B)P(B)}{P(A)}\]

And for a partition \(B_1,\ldots,B_n\) of \(\Omega\):

\[P(B_i|A)=\frac{P(A|B_i)P(B_i)}{\sum_j{P(A|B_j)P(B_j)}}\]

Bayes’ Rule really just rewrites the conditional probability using the multiplication rule (numerator) and the Theorem of Total Probability (denominator).

Bayes’ Rule was first formulated by an 18th century English clergyman, Thomas Bayes, it was only published after his death.

While Bayes’ Rule is important for Bayesian statistics, it is a result from probability theory and useful win both Bayesian and frequentist statistics.

(probably not) Thomas Bayes (public domain / Wikipedia)

Example: diagnostic test

Disease \(D\), with \(P(D)=0.001\), i.e. occurs only in \(0.1\%\) of the population.

There is a diagnostic test, which can give a positive (\(T\)) or negative (\(\bar{T}\)) result. The diagnostic test has \(95\%\) sensitivity (i.e. \(P(T|D)=0.95\)) and \(98\%\) specificity (i.e. \(P(\bar{T}|\bar{D})=0.98\)).

What is the probability that a patient has the disease if the test result is positive?

Note that \(D, \bar{D}\) is a partition of the outcome space.

Apply Bayes’s Rule:

\[ \begin{align} P(D|T) &= \frac{P(T|D)P(D)}{P(T|D)P(D)+P(T|\bar{D})P(\bar{D})} \\ &= \frac{0.95\cdot 0.001}{0.95\cdot 0.001 + (1-0.98)\cdot(1-0.001)} \\ &= 0.0454 \end{align} \]

Note that:

\[P(D|T)\propto P(T|D)\cdot P(D)\] where:

  • \(P(D|T)\) is the posterior probability

  • \(P(T|D)\) is the likelihood

  • \(P(D)\) is the prior probability

We can consider \(P(T)\) (the denominator) to be just a constant to schale \(P(D|T)\) so that it is a valid distribution.

Random variables

A random variable \(X\) is a function from the outcome space \(\Omega\) to the real line:

\[X:\Omega\rightarrow \mathbb{R}\]

Example

Consider the experiment of tossing a coin 2 times:

\[\Omega=\{(H,H),(H,T),(T,H),(T,T)\}\] The number of heads turning up is a random variable \(X\):

\[X((H,H))=2\] \[X((H,T))=1\] \[X((T,H))=1\] \[X((T,T))=0\]

Probability distributions

A probability mass function (pmf) \(p\) assigns to each realisation \(x\) of a discrete random variable X the probability \(P(X=x)=p(x)\).

It follows from the axioms of probability that \(p(x)\geq0\) and \(\sum_{x}{p(x)} = 1\).

What about continuous random variables?

For a continuous random variable \(X\), \(P(X=x)=0\) for all values of x (the probability of exactly realising one value among an infinity of possible values is 0). Hence it makes little sense to define a pmf.

Instead, we will define probabilities as areas under a curve. A probability density function (pdf) is a function \(p:\mathbb{R}\rightarrow\mathbb{R}^+\) so that

\[P(a<X\leq b)=\int_a^bp(x)dx\]

It follows from the axioms of probability that \(p(x)\geq0\) and \(\int_{-\infty}^{\infty}{p(x)dx} = 1\).

Note that while the axioms of probability imply that in the discrete case, a pmf satisfies \(p(x)\leq1\), in the continuous case, a pdf \(p(x)\) does not have to be bounded above by 1.

Example

If we have the pdf given by

\[ p(x)=\begin{cases} 2 & \mbox{ for } 0\leq x\leq 0.5 \\ 0 & \mbox{ otherwise} \end{cases} \]

Then it follows that

\[P(0.1<X\leq0.3)=\int_{0.1}^{0.3}2dx=[2x]_{0.1}^{0.3}=0.6-0.2=0.4\]

Expectation & variance

What is the expected or average / mean value for a given distribution? Let us define the expectation or the mean of a random value.

Discrete random variables: \[E(X)=\sum_{x}{x\,p(x)}\] Continuous random variables: \[E(X)=\int_{-\infty}^{\infty}{x\,p(x)\,dx}\] Notation: \[\mu=E(X)\]

We can also compute expectations for arbitrary functions \(h:\mathbb{R}\rightarrow\mathbb{R}\) of a random variable:

\[ E(h(X))=\begin{cases} \sum_x{h(x)\,p(x)} &\mbox{ if }x\mbox{ is discrete} \\ \int_{-\infty}^{\infty}{h(x)\,p(x)\,dx} &\mbox{ if }x\mbox{ is continuous} \end{cases} \]

One special case of such a function \(h\) is \(h(x)=(x-\mu)^2\) and is used to define the variance of a random variable.

The variance \(Var(X)=\sigma^2\) of a random variable \(X\) is defined as spread around the mean and obtained by averaging the squared differences \((x-\mu)^2\).

\[ \begin{align} \sigma^2 &=& E[(X-\mu)^2] \\ &=& E[(X-E(X))^2] \end{align} \]

The standard deviation \(\sigma\) has the advantage of being on the same scale as \(X\).

Conditional distributions

Discrete case:

\[p(x|C)=P(X=x|C)=\frac{P(\{X=x\}\cap C)}{P(C)}\]

Continuous case:

\[ p(x|C) = \begin{cases} p(x)/P(C) &\mbox{ for }x\in C \\ 0 &\mbox{ otherwise} \end{cases} \]

Joint distribution

A pair of random variables \((X,Y)\) will have a joint distribution and this is uniquely determined by their joint probability function \(p:\mathbb{R}^2\rightarrow\mathbb{R}_+\).

Discrete case (in this case: \(p:\mathbb{R}^2\rightarrow[0,1]\)): \[p(x,y) = P((X,Y)=(x,y)) = P(X=x, Y=y)\]

From the axioms of probability: \(p(x,y)\geq0\) and \(\sum_x\sum_y p(x,y) = 1\).

Continuous case: \[P(a<X\leq b,c<Y\leq d) = \int_a^b\int_c^d p(x,y)dxdy\]

From the axioms of probability: \(p(x,y)\geq0\) and \(\int_{-\infty}^{\infty}\int_{-\infty}^{\infty} p(x,y)\,dxdy = 1\).

Marginal distribution

The marginal distribution function of X can be obtained from the joint distribution function by summing (discrete case) or integrating (continuous case) over Y.

\[\,\] Discrete case: \[p_X(x)=P(X=x)=\sum_y P((X,Y)=(x,y))=\sum_y p(x,y)\] Continuous case: \[p_X(x)=\int_{-\infty}^{\infty}p(x,y)\,dy\]

Conditional distribution

We can define the conditional distribution function of X given Y.

\[p(x|y)=\frac{p(x,y)}{p_Y(y)}\]

As before for events, we define random variables \(X,Y\) to be independent if

\[p(x,y)=p_X(x)p_Y(y)\mbox{ for all }(x,y)\]

Probability theory for random variables

All the previous definitions and theorems also apply to probability mass and density functions.

For discrete random variables, this is obvious as the probability mass function simply specifies probabilities.

For continuous random variables, these follow from the definitions of joint, conditional and marginal distribution functions.

Bayes’ Rule

Let \(X,Y\) be 2 random variables. Then

\(p_{X|Y=y}(x|y)=\frac{f_{Y|X=x}(y|x)f_X(x)}{f_Y(y)}\)

Exchangability & independence

Given a dataset \(\{x_1,\ldots,x_n\}\), let \(p(x_1,\ldots,x_n)\) be the joint probability density or mass function of \(X_1,\ldots,X_n\). \[\,\]

If \(p(x_1,\ldots,x_n)=p(x_{\pi_1},\ldots,x_{\pi_n})\) for all permutations \(\pi\) of \(1,\dots,n\), then \(X_1,\ldots,X_n\) are exchangeable. \[\,\]

The subscript contains no information about the outcomes.

An important result is

\[ X_1,\ldots,X_n\mbox{ are exchangeable for all }n\iff\begin{cases} & X_1,\ldots,X_n|\theta\mbox{ are i.i.d.} \\ & \theta\sim p(\theta) \end{cases} \]

Unless specified otherwise we will always assume exchangeability.

Bayesian inference

Small- and chickenpox example

Let’s start with an example (taken from Stone, J.V. (2013). “Bayes’ Rule: A Tutorial Introduction to Bayesian Analysis.”, Sebtel Press.).

You wake up one morning with spots all over your face. You are worried and go to the doctor. The doctor tells you that \(90\%\) of people with smallpox present with spots on their face.

You are (naturally) very worried now as smallpox is a very serious disease (also: it has been eradicated since the 1980s).

However, more useful to know would be the probability of having smallpox.

Suppose doctors collect data on people presenting with smallpox and chickenpox and the symptoms they present. Based on this, the doctor will calculate

\[p(\mbox{spots }|\mbox{ smallpox}) = 0.9\]

and

\[p(\mbox{spots }|\mbox{ chickenpox}) = 0.8\]

These two expression are called the likelihood of the data (spotty face) given smallpox / chickenpox as the cause and are obtained from the sampling model that we assume for the data.

The maximum likelihood estimate for the disease based on these two is smallpox.

However, smallpox is very rare (in fact extinct these days) and chickenpox more common.

The doctor also has the recorded prevalences for these two diseases; this is the prior knowledge:

\[p(\mbox{smallpox})=0.001\] and

\[p(\mbox{chickenpox})=0.1\]

Further, the overall proportion of people with spots on their faces in the population, called the marginal likelihood or the evidence, is given by

\[p(\mbox{spots})=0.081\]

Using Bayes’ Rule:

\[p(\mbox{smallpox }|\mbox{ spots}) = \frac{p(\mbox{spots }|\mbox{ smallpox})\,p(\mbox{smallpox}) }{p(\mbox{spots})} = \frac{0.9\cdot0.001}{0.081} = 0.0111\]

and

\[p(\mbox{chickenpox }|\mbox{ spots}) = \frac{p(\mbox{spots }|\mbox{ chickenpox})\,p(\mbox{chickenpox}) }{p(\mbox{spots})} = \frac{0.8\cdot0.1}{0.081} = 0.9877\]

While we cannot be certain, it is very likely that you have chickenpox, not smallpox and so you can relax.

Reproduced from Stone, J.V. (2013). “Bayes’ Rule: A Tutorial Introduction to Bayesian Analysis”. Sebtel Press.

Updating one’s belief

What we have done here, is we have updated a prior belief after observing some data.

Specifically:

\[\mbox{hypothesis = disease is smallpox / chickenpox}\]

\[\mbox{data = symptoms of spots on face}\]

Bayesian inference:

\[p(\mbox{hypothesis | data}) = \frac{p(\mbox{data|hypothesis})\,p(\mbox{hypothesis})}{p(\mbox{data})}\]

or:

\[\mbox{posterior}=\frac{\mbox{likelihood}\times\mbox{prior}}{\mbox{evidence}}\]

Often, the hypothesis can be framed as a statement about a parameter of interest \(\theta\). Writing \(y\) for the data, we can write quite generally:

\[p(\theta|y)=\frac{p(y|\theta)\,p(\theta)}{p(y)}\]

This leads to the probability density version of Bayes’ Rule, which underlies all of Bayesian statistics:

\[f_{\Theta|\mathbf{Y}=\mathbf{y}}(\theta|\mathbf{y})=\frac{f_\mathbf{Y}(\mathbf{y}|\theta)\,f_\Theta(\theta)}{f_\mathbf{Y}(\mathbf{y})}=\frac{f_\mathbf{Y}(\mathbf{y}|\theta)\,f_\Theta(\theta)}{\int_{\Omega_\theta} f_{\mathbf{Y}|\Theta=\theta}(\mathbf{y}|\theta)\,f_\Theta(\theta)d\theta}\]

It is important to note that the denominator (the evidence) is fixed for a given dataset, i.e. it is constant. It normalises the posterior distribution so that it sums (pmf), resp. integrates (pdf), to 1 – this is a requirement for the posterior to be a valid probability distribution.

This means that

\[p(\theta|y)\propto p(y|\theta)\,p(\theta)\]

or put differenty

\[\mbox{posterior}\propto\mbox{likelihood}\times\mbox{prior}\]

In Bayesian statistics, both data variables and distribution parameters are considered to be random.

Beta prior, binomial sampling model

Suppose we interested in estimating the probability \(\pi\) of heads of a particular coin.

As we have no reason to belief that the coin is biased, we may assume a prior distribution for \(\pi\) which has maximum density at \(\pi=0.5\).

One such distribution is the Beta(a=4,b=4) distribution.

Recall that \(\Pi\sim \mbox{Beta}(a,b)\) for parameters \(a>0,b>0\) if

\[p(\pi)=\begin{cases} \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\pi^{a-1}(1-\pi)^{b-1} \quad & \mbox{if }\pi\in[0,1]\\ 0 & \mbox{otherwise} \end{cases}\]

where \(\Gamma(\alpha)=\int_0^{\infty}z^{\alpha-1}e^{-z}dz\) is the gamma function.

For simplicity, we will write \(\gamma(a,b)=\frac{\Gamma(a,b)}{\Gamma(a)\Gamma(b)}\).

Further, if \(\Pi\sim\mbox{Beta}(a,b)\):

\[E(\Pi)=\frac{a}{a+b},\quad Var(\Pi)=\frac{ab}{(a+b)^2(a+b+1)}\]

\[\mbox{mode}(\Pi)=\begin{cases} \frac{a-1}{a+b-2}\, &\mbox{ if }a>1,b>1 \\ [0,1] \mbox{ (any value)} &\mbox{if }a=b=1 \\ \{0,1\} \mbox{ (bimodal)} &\mbox{if }a<1,b<1 \\ 0 &\mbox{if }a\leq1,b>1 \\ 1 &\mbox{if }a>1, b\leq1 \end{cases}\]

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_align()`).

With \(a=4\) and \(b=4\), in our case the prior distribution becomes

\[p(\pi)=140\cdot\pi^3\cdot(1-\pi)^3\]

if \(\pi\in[0,1]\) and \(p(\pi)=0\) otherwise.

Suppose the coin is flipped \(n\) times and we observe \(k\) heads.

The likelihood of observing this data, given \(\pi\), is is obtained from the binomial distribution

Recall \(Y\sim\mbox{Bin}(n,\pi)\) for parameters \(n\in\{0,1,2,\ldots\}\) and \(\pi\in[0,1]\) if

\[p(Y=k|\pi)=\begin{cases} \binom{n}{k}\pi^k(1-\pi)^{n-k} \; & \mbox{if } k\in\{0,1\ldots,n\}\\ 0 & \mbox{otherwise} \end{cases}\]

And

\[ E[Y|\pi]=n\pi,\;Var(Y|\pi)=n\pi(1-\pi),\;\mbox{mode}(Y|\pi)=\lfloor(n+1)p\rfloor \]

Suppose \(n=6\), \(k=2\), then the likelihood becomes

\[p(k=2|\pi)=\binom{6}{2}\pi^2(1-\pi)^4\]

Given the prior distribution

\[p(\pi)=140\cdot\pi^3\cdot(1-\pi)^3\]

and the likelihood

\[p(k=2|\pi)=\binom{6}{2}\cdot\pi^2\cdot(1-\pi)^4\]

Exercise: find the posterior distribution \(p(\pi|k=2)\).

Quite generally, the posterior is given by

\[p(\pi|k) = \frac{p(k|\pi)\,p(\pi)}{\int_0^1 p(k|\theta)\,p(\theta) d\theta}\]

The numerator is given by:

\[ \begin{align} p(k|\pi)\,p(\pi) &=& \binom{n}{k}\pi^k(1-\pi)^{n-k} \gamma(a,b)\pi^{a-1}(1-\pi)^{b-1} \\ &=& \gamma(a,b)\binom{n}{k}\pi^{a+k-1}(1-\pi)^{b+n-k-1} \end{align} \]

And the denominator is given by

\[\int_0^1 p(k|\theta)\,p(\theta) d\theta = \gamma(a,b)\binom{n}{k}\int_0^1 \theta^{a+k-1}(1-\theta)^{b+n-k-1} d\theta\]

To solve this integral, make use of the fact that the Beta(\(\alpha\),\(\beta\)) distribution needs to integrate to 1 for it to be a valid probability distribution:

\[ \begin{align} & \int_0^1 \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)+\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1} d\theta &=& 1 \\ \Rightarrow & \int_0^1\theta^{\alpha-1}(1-\theta)^{\beta-1} d\theta &=& \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)} \end{align} \]

Writing \(\alpha=a+k\) and \(\beta=b+n-k\), we see that

\[\int_0^1 \theta^{a+k-1}(1-\theta)^{b+n-k-1} d\theta = \frac{\Gamma(a+k)\Gamma(b+n-k)}{\Gamma(a+b+n)}\]

Therefore, the posterior density for \(\pi\) is given by

\[ \begin{align} p(\pi|k) &=& \frac{\gamma(a,b)\binom{n}{k}\pi^{a+k-1}(1-\pi)^{b+n-k-1}}{\gamma(a,b)\binom{n}{k}\frac{\Gamma(a+k)\Gamma(b+n-k)}{\Gamma(a+b+n)}} \\ &&\\ &=& \frac{\Gamma(a+b+n)}{\Gamma(a+k)\Gamma(n-k+b)}\pi^{a+k-1}(1-\pi)^{b+n-k-1} \end{align} \]

which is a Beta(\(a+k\),\(b+n-k\)) distribution.

Filling in \(k=2\), \(n=6\), \(a=4\), \(b=4\), we get a Beta(6,8) distribution:

\[p(\pi|k=2)=10296\cdot\pi^5\cdot(1-\pi)^7\]

for \(\pi\in[0,1]\).

Note: Remember that \(\Gamma(m)=(m-1)!\) for \(m=0,1,2,\ldots\) .

Summary

So we see that

\[\,\]

\[\begin{cases} \mbox{prior }\, p(\pi) & = \mbox{Beta}(a,b) \\ \\ \mbox{likelihood }\, p(k|\pi) & = \mbox{Bin}(n,\pi) \end{cases}\]

\[\,\]

\[\Rightarrow\mbox{posterior }\, p(\pi|k) = \mbox{Beta}(a+k,b+n-k)\]

Discussion

Unlike frequentist statistics, where we would have obtained a point estimate \(\hat{\pi}\), Bayesian statistics yield a posterior distribution for \(\pi\).

We can still come up with a point estimate by, e.g., finding the value of \(\pi\) for which the posterior distribution achieves its maximum (assuming this exists and is unique). This is called maximum a posteriori (MAP) estimation.

There are other point estimators we could use: e.g. the expectation of the posterior distribution \(E\left[p(\pi|k)\right]\).

We will get back to this later.

For our example the MAP is achieved for \(\pi=\hat{\pi}_{MAP}\) so that \(\frac{d}{d\pi}p(\pi|k)=0\).

\[\frac{d}{d\pi}p(\pi|k)=0 \iff \pi=\pi_{MAP}=\frac{a+k-1}{a+b+n-2}\]

For \(n=6\), \(k=2\), \(a=4\), \(b=4\) this yields \(\hat{\pi}_{MAP}=5/12=0.41\bar{66}\).

The likelihood is maximised at the (frequentist) maximum likelihood estimate \(\hat{\pi}_{MLE}=\frac{k}{n}\) which for \(n=6\), \(k=2\) is \(\hat{\pi}_{MLE}=2/6=0.\bar{33}\).

The prior distribution for this example is maximised at \(\hat{\pi}=0.5\).

The data “pushes” the prior distribution towards the likelihood function.

Now, see what happens if we increase the amount of data, i.e. the number of repeated experiments, assuming the proportion of heads remains the same and keeping the same Beta(4,4) prior.

\[\,\]

  • For \(n=6\), \(k=2\), the posterior is a Beta(6,8) distribution.

  • For \(n=12\), \(k=4\), the posterior is a Beta(8,12) distribution.

  • For \(n=60\), \(k=20\), the posterior is a Beta(24,44) distribution.

The posterior becomes more and more indistinguishable from the likelihood: the more data there is, the less important the prior becomes – the likelihood dominates.

Posteriors for 4 different prior & sample size combinations

Discrete prior, binomial sampling model

That the likelihood dominates the prior for large datasets is not fully true: where the prior distribution (whether it is a pdf or a pmf) is zero, the posterior distribution is also zero – no matter how much data. This follows from

\[p(\theta|y)\propto p(y|\theta)\,p(\theta)\]

Among others, this means that a discrete prior leads to a discrete posterior distribution.

Let us return to the same coin throwing experiment. Suppose that rather than a Beta(a,b) prior, we know that the coin used for the experiment can only be one of 6 coins: 3 are biased towards heads with \(\pi=P(\mbox{heads})=0.8\), 2 are biased towards tails with \(\pi=0.3\) and 1 coin is fair with \(\pi=0.5\). Assuming each coin to be equally likely to be picked, this gives the following prior distribution for \(\pi\):

\[p(\pi)=\begin{cases} 1/3=0.\bar{33} & \mbox{if }\pi=0.3 \\ 1/6=0.1\bar{66} & \mbox{if }\pi=0.5 \\ 1/2=0.5 & \mbox{if }\pi=0.8 \end{cases}\]

The likelihood is still given by the binomial distribution:

\[p(k|\pi)=\binom{n}{k}\pi^k(1-\pi)^{n-k}\]

Here we have \(n=6\) and \(k=2\).

This can be written down / computed using tables.

Note on the binomial distribution

Recall that the binomial distribution results from \(n\) independent trials of binary experiments with probability parameter \(\pi\).

The coin toss example was in fact based on repeated binary / Bernoulli experiments.

We used the binomial distribution, because the sum of successes is a sufficient statistic for \(\pi\), i.e. it contains all the information we need to make inference about \(\pi\).

This can be seen by writing down the likelihood for \(Y_1,\ldots,Y_n\) independent, identically Bernoulli(\(\pi\)) distributed random variables:

\[p(y_1,\ldots,y_n|\pi)=\prod_i p(y_i|\pi)=\prod_i \pi^{y_i}(1-\pi)^{1-y_i}=\pi^{\sum_i y_i}(1-\pi)^{n-\sum_i y_i}=\pi^k(1-\pi)^{n-k}\]

When we consider \(Y=\sum_iY_i\), and want to make statements about \(P(Y=k)\) we need to also consider how many different ways there are to obtain the same sum – this is where the binomial coefficient \(\binom{n}{k}\) comes from.

Posterior predictive distribution

Let us return the the example where \(\pi\sim \mbox{Beta}(a,b)\) and \(k|\pi\sim \mbox{Bin}(n,\pi)\).

The posterior distribution \(p(\pi|k)\) allows us to make inference about \(\pi\) after observing the data \(Y=\sum_i Y_i = k\).

However we can also make inference about future data \(\tilde{Y}_{n+1}\).

For this we need to derive the posterior predictive distribution \(p(\tilde{Y}_{n+1}|k)=p(\tilde{Y}_{n+1}|y_1,\ldots,y_n)\).

\[ \begin{align} p(\tilde{Y}_{n+1}=1|y_1,\ldots,y_n) &=& \int_0^1 p(\tilde{Y}_{n+1}=1,\pi|y_1,\ldots,y_n)d\pi\\ &=& \int_0^1 p(\tilde{Y}_{n+1}=1|\pi,y_1,\ldots,y_n)\,p(\pi|y_1,\ldots,y_n)d\pi \\ &=& \int_0^1 \pi \,p(\pi|y_1,\ldots,y_n)d\pi \\ &=& E[\pi|y_1,\ldots,y_n] = E[\pi|k] \\ &=& \frac{a+k}{a+b+n} \end{align} \]

where the last line follows from the fact that the posterior distribution for \(\pi|k\) is a Beta(a+k,b+n-k) distribution.

It follows that

\[ \begin{align} p(\tilde{Y}_{n+1}=0|y_1,\ldots,y_n) &=& 1-p(\tilde{Y}_{n+1}=1|y_1,\ldots,y_n) \\ && \\ &=& \frac{b+n-k}{a+b+n} \end{align} \]

Note that

  1. The posterior predictive distribution does not depend on any unknown quantities (otherwise we would not be able to use it to make predictions).

  2. The posterior predictive distribution depends on observed data. This may seem to violate the exchangeability condition. But this is for future data. The past data \(y_1,\ldots,y_n\) provide information about \(\pi\) and this in turn provides information about \(\tilde{Y}_{n+1}\). If this was not the case, we would not be able to infer anything about the unsampled population given the sampled cases.

Prior distribution

The likelihood factor in Bayesian models appears familiar: you are familiar with this from other modules (e.g. STA6103 - GLM).

You may struggle with the prior distribution: how do you decide what is a good prior distribution?

A good solution is to ask experts in the field you are working in. You can even combine priors from several experts through a mixture distribution of priors and this also allows you to specify different weights for different expert.

However, experts are not always available…

Informative & vague priors

In the first coin toss experiment we used a Beta(4,4) distribution. We argued this was appropriate since it had highest density at \(\pi=0.5\) which reflected our prior belief that there is little reason to assume the coin is not well balanced.

Clearly this was informative: the prior distribution the prior expressed specific, definite information about \(\pi\) and favoured \(\pi=0.5\).

But a Beta(2,2) and a Beta(8,8) would also have had highest density at \(\pi=0.5\).

Depending on how sure we are, we can go for a more peaked distribution.

If we have no prior knowledge at all, you could also argue that all values of \(\pi\) are equally likely. This would justify a uniform(0,1) distribution.

A prior distribution which expresses only general or vague information about a parameter is called diffuse or vague prior and sometimes also non-informative prior.

Note that assuming all possible values for a parameter to be equally likely is not, strictly speaking, equivalent to being completely ignorant. Every prior will be informative to some degree and so it is generally recommended to not use the expression non-informative prior though it is common.

Note: Beta(1,1) = Uniform(0,1).

How do you come up with a vague prior?

Easy in the case of the discrete or continuous uniform distribution, but this does not work for discrete distributions with an infinity of possible values or a continuous distributions over an open-ended interval.

Solution: choose priors with an objective mean / median and large variance, e.g. \(\mathcal{N}(0,10)\).

Such priors are called weakly informative and are very useful for regularisation, i.e. to keep inferences in a reasonable range.

Jeffreys prior

Sir Harold Jeffreys devised a general rule for generating an objective or vague priors for a sampling model \(p(y|\theta)\). The Jeffreys prior is given by

\[p_J(\theta)\propto \sqrt{I(\theta)}\]

where \(I(\theta)=-E\left[\left.\frac{\delta^2}{\delta\theta^2}\log\, p(X|\theta)\right|\theta\right]\) is the Fisher information.

This can lead to prior distributions which are not actually probability distributions. These are called improper priors (see practical).

An important property of Jeffreys priors is that they are invariant under transformation.

Example

Let \(Y\sim Bin(n,\theta)\). Derive \(p_J(\theta)\).

We have:

\[ \begin{align} \frac{\delta}{\delta\theta}\log\, p(Y=y|\theta) &=& \frac{\delta}{\delta\theta}\log\left(\binom{n}{y}\theta^y(1-\theta)^{n-y}\right) \\ &=& \frac{\delta}{\delta\theta}\left[\log\binom{n}{y}+y\log\theta+(n-y)\log(1-\theta)\right] \\ &=& \frac{y}{\theta}-\frac{n-y}{1-\theta} \end{align} \]

and so

\[\frac{\delta^2}{\delta\theta^2}\log\,p(y|\theta)=-\frac{y}{\theta^2}-\frac{n-y}{(1-\theta)^2}\]

Hence:

\[J(\theta)=-E\left[-\frac{y}{\theta^2}-\frac{n-y}{1-\theta}\right]=\frac{E[Y|\theta]}{\theta^2}+\frac{n-E[Y|\theta]}{(1-\theta)^2}\]

Note that \(Y\sim Bin(n,\theta)\Rightarrow E[Y|\theta]=n\theta\):

\[J(\theta)=\frac{n\theta}{\theta^2}+\frac{n-n\theta}{(1-\theta)^2}=\frac{n}{\theta}+\frac{n}{1-\theta}=\frac{n}{\theta(1-\theta)}\]

Therefore:

\[p_J(\theta)\propto\sqrt{\frac{n}{\theta(1-\theta)}}\]

In other words:

\[p_J(\theta)\propto\theta^{-1/2}(1-\theta)^{-1/2}\]

If \(\theta\sim \mbox{Beta}\left(\frac{1}{2},\frac{1}{2}\right)\), then \(p(\theta)=\frac{\Gamma(1)}{\Gamma\left(\frac{1}{2}\right)\Gamma\left(\frac{1}{2}\right)}\theta^{-1/2}(1-\theta)^{-1/2}\).

Therefore, if \(X\sim Bin(n,\theta)\), then the Jeffreys prior distribution for \(\theta\) is a Beta\(\left(\frac{1}{2},\frac{1}{2}\right)\) distribution. This prior is proper.

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_align()`).

Conjugate prior

We can also choose a family of prior distribution for mathematical simplicity.

We saw earlier that

\[\mbox{beta prior} \quad + \quad \mbox{binomial sampling model} \quad\Rightarrow\quad \mbox{beta posterior}\]

This is called conjugacy: the beta distribution is the conjugate distribution for a binomial sample model.

Formally we can define:

A class \(\mathcal{P}\) of prior probability distributions is called conjugate for a sampling model \(p(y|\theta)\) if

\[p(\theta)\in\mathcal{P}\quad\Rightarrow\quad p(\theta|y)\in\mathcal{P}\]

Example: Gamma prior, Poisson sampling model

Recall that if \(Y\sim \mbox{Pois}(\lambda)\), then

\[\begin{cases} P(Y=k|\lambda) &= \frac{\lambda^ke^{-\lambda}}{k!} & \;\mbox{if }k=0,1,2\ldots\mbox{ and 0 otherwise} \\ E[Y|\lambda] &=\lambda & \\ Var(Y|\lambda) &=\lambda & \end{cases}\]

Suppose \(Y_i\sim_{\mbox{iid}} \mbox{Pois}(\lambda)\), \(i=1,\ldots,n\) and we observe data \(y_1,\ldots,y_n\).

The joint pdf of the data is given by

\[ \begin{align} p(y_1,\ldots,y_n|\lambda) &=&\prod_{i=1}^n p(y_i|\lambda) \\ &=&\prod_{i=1}^n \frac{1}{y_i!}\lambda^{y_i}e^{-\lambda} \\ &=&\frac{1}{\prod_i y_i!}\lambda^{\sum_i y_i} e^{-n\lambda} \end{align} \]

The factor \(\frac{1}{\prod_i y_i!}\) is constant for each dataset. So if we compare densities for different values of \(\lambda\), this factor will cancel out. All information about \(\lambda\) is therefore contained in \(\sum_i y_i\) (as was the case for the i.i.d. binary data).

\(\sum_i y_i\) is a sufficient statistic for the parameter \(\lambda\) in the Poisson sampling model.

Posterior density

If we assume a prior distribution \(p(\lambda)\), then the posterior density for \(\lambda\) given the data is given by

\[ \begin{cases} p(\lambda|y_i,\ldots,y_n) & \propto & p(\lambda) \, p(y_1,\ldots,y_n|\lambda) \\ & \propto & p(\lambda) \, \lambda^{\sum y_i}e^{-n\lambda} \end{cases} \]

If we want \(p(\lambda)\) to be conjugate for the Poisson sampling model, i.e. we want \(p(\lambda|y_1,\ldots,y_n)\) to be in the same family of distributions as \(p(\lambda)\), then the above implies that \(p(\lambda)\) needs to include terms of the form \(\lambda^{c_1}e^{-c_2\lambda}\).

The simplest class of such distributions is the family of gamma distributions (these include only such terms).

Recall \(\Lambda\sim\Gamma(a,b)\) with parameters \(a>0, b>0\) if

\[ p(\lambda|a,b)=\begin{cases} \frac{b^a}{\Gamma(a)}\lambda^{a-1}e^{-b \lambda} &\mbox{ if }x>0\\ 0 &\mbox{ otherwise} \end{cases} \]

where \(\Gamma(\alpha)=\int_0^{\infty}z^{\alpha-1}e^{-z}dz\) is the gamma function.

And

\[E(\Lambda)=\frac{a}{b},\, Var(\Lambda)=\frac{a}{b^2},\,\mbox{mode}(\Lambda)=\frac{a-1}{b}\,\mbox{ if }a>1,\, 0\mbox{ otherwise}\]

Warning: Removed 1 row containing non-finite outside the scale range
(`stat_align()`).

Suppose \[\begin{cases} Y_1,\ldots,Y_n\sim_{\mbox{iid}}\mbox{Pois}(\lambda) \; &\mbox{(sampling model)}\\ & \\ \lambda\sim\Gamma(a,b) &\mbox{(prior distribution)} \end{cases}\]

Then

\[ \begin{align} p(\lambda|y_1,\ldots,y_n) &\propto & p(\lambda)\,p(y_1,\ldots,y_n) \\ &\propto & \left(\lambda^{a-1}e^{-b\lambda}\right) \left(\lambda^{\sum y_i} e^{-n\lambda}\right) \\ &\propto & \lambda^{a+\sum y_i-1} e^{-(b+n)\lambda} \end{align} \]

This is a \(\Gamma(a+\sum_i y_i,b+n)\) distribution.

So we see that

\[\begin{cases} \mbox{prior }\, p(\lambda) & = \Gamma(a,b) \\ \\ \mbox{likelihood }\, p(y_i,\ldots,y_n|\lambda) & = \mbox{Pois}(\lambda) \end{cases}\]

\[\Rightarrow\mbox{posterior }\, p(\lambda|y_1,\ldots,y_n) = \Gamma(a+\sum_i y_i,b+n)\]

Conjugate priors for exponential family distributions

The binomial and Poisson sampling models we have seen so far are examples of one-parameter exponential family distributions.

A one-parameter exponential family distribution is any distribution whose density (or mass) function can be expressed under the form

\[p(y|\theta)=h(y)c(\theta)e^{\theta t(y)}\]

where \(\theta\) is the unknown parameter and \(t(y)\) is the sufficient statistic for the sampling model.

A general conjugate prior for a one-parameter exponential family sampling model is given by

\[p(\theta|n_0,t_0)=\kappa(n_0,t_0)c(\theta)^{n_0}e^{n_0t_0\theta}\]

Given observed data for i.i.d. variables \(Y_1,\ldots,Y_n\), the posterior is then

\[ \begin{align} p(\theta|y_1,\ldots,y_n) &\propto& p(\theta)p(y_1,\ldots,y_n|\theta) \\ &\propto& c(\theta)^{n_0+n}\exp{\left(\theta\cdot\left[n_0t_o+\sum_{i=1}^n t(y_i)\right]\right)} \\ &\propto& p(\theta|n_0+n,n_0t_o+\sum_it(y_i)) \end{align} \]

Note: \(n_0\approx\) “prior sample size”, \(t_0\approx\) “prior guess of \(t(Y)\)”.

As an aside, recall from the GLM module:

The two-parameter (location \(\theta\) and scale \(\phi\)) exponential family distributions

\[f(y|\theta,\phi)=exp\left(\frac{y\theta-b(\theta)}{a(\phi)}+c(y,\phi)\right)\]

Convince yourself this is the same thing!

\[\,\]

Solution:

  • Set \(a(\phi)=1\) (effectively get rid of this parameter)

  • Set \(c(y,\phi)=\log\, h(y)\)

  • Set \(-b(\theta)=\log\, c(\theta)\)

  • Set \(y=t(y)\)

How to choose a prior distribution

  • The shape of the prior shows the “belief weights” we give for all possible values before we look at the data.

  • The prior must not come from the data. The posterior is proportional to \(prior \,\times\, likelihood\). The multiplication means the prior and likelihood must be independent!

  • If you don’t want to favor any one value over another, use a vague or diffuse prior, e.g. a normal distribution with a large variance or a uniform prior over the entire possible range.

  • The choice of prior is not crucial. All priors that have reasonable probability over the realistic range of the data will have quite similar posteriors. We have seen that with enough data, the likelihood will have the greater contribution to the posterior. Priors can, however, restrict the range of the posterior distribution.

  • Conjugate priors simplify calculations greatly as the posterior will be of the same family.

  • So identify the family of conjugate priors for your sampling model, then pick the one with a distribution function whose shape matches your beliefs closest.

The normal model

So far we have studied one-parameter distributions.

What about two-parameter distributions?

For example the normal distribution \(\mathcal{N}(\mu,\sigma^2)\) with pdf

\[\phi(x|\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right)\]

To be able to do joint inference for \((\mu,\sigma^2)\), we can break the problem into 2 one-parameter problems.

For example, for the prior:

\[ p(\mu,\sigma^2)=p(\mu|\sigma^2)\cdot p(\sigma^2) \]

And, after observing data \(\{y_i\}_{i=1}^n\) with \(Y_i\sim_{\mbox{iid}}\mathcal{N}(\mu,\sigma^2)\), similarly for the posterior distribution:

\[ p(\mu,\sigma^2|y_1,\ldots,y_n)\propto p(\mu|\sigma^2,y_1,\ldots,y_n)p(\sigma^2|y_1,\ldots,y_n) \]

The above results follow straight from the definition of conditional probability: \(P(A\cap B)=P(A|B)\cdot P(B)\).

Posterior for the mean conditional on the variance

One can show (Practicals 1 & 2, Exercise 8) that assuming \(\sigma^2\) to be fixed, i.e. by conditioning on \(\sigma^2\), then a conjugate prior distribution for the mean \(\mu\) is the normal distribution itself:

\[ \begin{cases} M|\sigma^2 &\sim& \mathcal{N}(\mu_0,\sigma_0^2)\qquad &\mbox{(prior)} \\ Y_1,\ldots,Y_n|\mu,\sigma^2 &\sim& \mathcal{N}(\mu,\sigma^2)\qquad &\mbox{(sampling model)} \end{cases} \]

Then the posterior is also a normal distribution:

\[ M|\sigma^2,y_1,\ldots,y_n\sim\mathcal{N}(\mu_n,\sigma_n^2) \]

where

\[ \begin{cases} \mu_n &=& \frac{\frac{1}{\sigma_0^2}}{\frac{1}{\sigma_0^2}+\frac{n}{\sigma^2}}\mu_0 + \frac{\frac{n}{\sigma^2}}{\frac{1}{\sigma_0^2}+\frac{n}{\sigma^2}}\bar{y} \\ &&\\ \sigma_n^2 &=&\frac{1}{\frac{1}{\sigma_0^2}+\frac{n}{\sigma^2}} \end{cases} \]

If we write

\[ \begin{cases} \tau &=& 1/\sigma^2 &\qquad \mbox{(sampling precision)} \\ \tau_0 &=& 1/\sigma_0^2 = \kappa_0/\sigma^2 &\qquad \mbox{(prior precision)} \\ \tau_n &=& 1/\sigma_n^2 &\qquad \mbox{(posterior precision)} \end{cases} \]

(Think of \(\kappa_0\) as the prior sample size.)

Then for the posterior mean and precision we can write (defining \(\kappa_n=\kappa_0+n\) - the combined prior and observed sample size):

\[ \begin{cases} \mu_n &=& \frac{\tau_0}{\tau_0+n\tau}\mu_0 + \frac{n\tau}{\tau_0+n\tau}\bar{y} &=&\frac{\kappa_0}{\kappa_n}\mu_0+\frac{n}{\kappa_n}\bar{y} \\ &&\\ \tau_n &=& \tau_0+n\tau &=& \kappa_n\tau \end{cases} \]

In other words, for inference for \(\mu\) conditional on \(\sigma^2\):

  • The posterior mean is a weighted average of the prior mean and the sample mean (with the weights determined by the prior and sample precisions and the number of observed data points).

  • The posterior precision is a sum of the prior precision and the sample precision, with the latter weighted by the number of observations.

Posterior for the variance

We now need a prior for the variance \(\sigma^2\). This needs to have support on \((0,\infty)\) as it is for a variance parameter. The gamma distribution would be a candidate, but it turns out not to be conjugate for the normal sampling model. However, the gamma is conjugate for the precision \(\tau=1/\sigma^2\); or in other words, the inverse-gamma distribution is a conjugate prior for the variance \(\sigma^2\) of a normal sampling model:

\[ \begin{cases} \mbox{precision}: & \tau=1/\sigma^2 &\sim& \Gamma(a,b) \\ \mbox{variance}: & \sigma^2 &\sim&\mbox{inv-}\Gamma(a,b) \end{cases} \]

For interpretability, we will not use \(a,b\) to parameterise the inverse-gamma distribution. Instead we will use \(\nu_0\), interpretable as a prior sample size, and \(\sigma_0\), interpretable as a prior variance.

\[ \tau=1/\sigma^2\sim\Gamma\left(\frac{\nu_0}{2},\frac{\nu_0}{2}\sigma_0^2\right) \]

Now we can derive the posterior:

\[\begin{align} p(\sigma^2|y_1,\ldots,y_n) &\propto& p(\sigma^2)p(y_1,\ldots,y_n|\sigma^2) \\ &=& p(\sigma^2)\int p(y_1,\ldots,y_n|\mu,\sigma^2)p(\mu|\sigma^2)d\mu \end{align}\]

(This integral is left as an exercise.)

The solution is that

\[ 1/\sigma^2|y_1,\ldots,y_n\sim\Gamma\left(\frac{\nu_n}{2},\frac{\nu_n}{2}\sigma_n^2\right) \]

where

\[ \begin{cases} \nu_n &=&\nu_0+n \\ \sigma_n^2 &=&\frac{1}{\nu_n}\left[\nu_0\sigma_0^2+(n-1)s^2+\frac{\kappa_0 n}{\kappa_n}(\bar{y}-\mu_0^2)\right] \end{cases} \]

and \(s^2=\frac{1}{n-1}\sum_{i=1}^n(y_i-\bar{y})^2\) is the usual sample variance.

Summary

We can do inference for the joint distribution of \((\mu,\sigma^2)\) for the two-parameter normal distribution, by casting the problem as two one-parameter problems (once by conditioning on \(\sigma^2\) and once by integrating out \(\mu\)).

With the following prior distributions and sampling model:

\[ \begin{cases} 1/\sigma^2 &\sim\Gamma\left(\frac{\nu_0}{2},\frac{\nu_0}{2}\sigma_0^2\right) \\ \mu|\sigma^2 &\sim \mathcal{N}(\mu_0,\sigma^2/\kappa_0) \\ Y_1,\ldots,Y_n|\mu,\sigma^2 &\sim_{iid}\mathcal{N}(\mu,\sigma^2) \end{cases} \]

We can do inference for \(\mu\) and \(\sigma^2\) using the following posterior distributions:

\[ \begin{cases} \mu|y_1,\ldots,y_n,\sigma^2 &\sim \mathcal{N}(\mu_n,\sigma^2/\kappa_n) \\ 1/\sigma^2|y_1,\ldots,y_n &\sim\Gamma\left(\frac{\nu_n}{2},\frac{\nu_n}{2}\sigma_n^2\right) \end{cases} \]

with

\[ \begin{cases} \mu_n &=&\frac{\kappa_0}{\kappa_n}\mu_0+\frac{n}{\kappa_n}\bar{y} \\ \kappa_n &=& \kappa_0 + n \\ \sigma_n^2 &=&\frac{1}{\nu_n}\left[\nu_0\sigma_0^2+(n-1)s^2+\frac{\kappa_0 n}{\kappa_n}(\bar{y}-\mu_0^2)\right] \\ \nu_n &=&\nu_0+n \end{cases} \]

Bayesian estimation

The main focus of inference in Bayesian statistics are

  • The posterior distribution of the parameter \(\theta\) given the data \(y\): \(p(\theta|y)\).

  • The posterior predictive distribution of new data \(\tilde{y}\) given observed data \(y\): \(p(\tilde{y}|y)\).

Often we will look at the distributions of functions of \(\theta\), \(\tilde{y}\).

Posterior distributions are probability mass or density functions and therefore allow us to make direct statements about the probability for \(\theta\), resp. \(\tilde{y}\), or a function of these, to take particular values or lie in certain regions.

Frequentist statistics by contrast yields point estimates \(\hat{\theta}\) and \(\hat{y}\). Together with point estimates for the uncertainty associated with these estimates (usually under the form of standard errors), these allow inference in their own right.

While the Bayesian posterior distributions have many advantages over point estimates, it is sometimes useful to have summarise the posterior distributions by point estimates. We had already in Section 2 briefly touched upon these.

Bayes estimator

A Bayes estimator \(\hat{\theta}_B\) is an estimator that minimses the posterior expected value of a loss function (i.e. the posterior expected loss).

Mathematically:

\[\hat{\theta}_B=\arg\min_{\hat{\theta}} \int_\mathcal{Y}\int_\mathcal{\Theta}\mathcal{C}(\theta-\hat{\theta})p(\theta,y)d\theta dy\]

where \(\mathcal{C}(.)\) is a cost function.

Note that the integration is over both \(\theta\) and \(y\).

We can use any cost function, but the most commonly used ones include:

  • Quadratic loss: \(\mathcal{C}(x)=x^2\).

  • Absolute loss: \(\mathcal{C}(x)=|x|\).

  • 0-1 loss: \(\mathcal{C}(x)=\begin{cases}0\qquad\mbox{if }|x|<\epsilon \\ 1\qquad\mbox{if }|x|\geq\epsilon\end{cases}\).

We will see that:

  • Quadratic loss \(\Rightarrow\) posterior mean.

  • Absolute loss \(\Rightarrow\) posterior median.

  • 0-1 loss \(\Rightarrow\) posterior mode.

Whatever cost function is used, the joint density (resp. mass) function \(p(\theta,y)\) is usually rewritten using the multiplication rule: \(p(\theta,y)=p(\theta|y)p(y)\).

The optimisation problem then becomes

\[\hat{\theta}_B=\arg\min_{\hat{\theta}}\int\int \mathcal{C}(\theta-\hat{\theta})p(\theta|y)d\theta \, p(y)dy\]

Since the probability axioms guarantee that \(p(y)\geq0\), it is enough to find the estimator that minimises the inner integral:

\[\hat{\theta}_B=\arg\min_{\hat{\theta}}\int \mathcal{C}(\theta-\hat{\theta})p(\theta|y)d\theta\]

Note that there is a slight change in optimisation here: we find the optimum now for one realisation of \(y\), not across all realisations.

Quadratic loss function

See Practical 3 for the derivation of this estimator:

\[\hat{\theta}_B=\int \theta p(\theta|y)d\theta=E[\theta|y]\] This particular estimator is also called the minimum mean squared error estimator (MMSE) and is the most widely used point estimate for posterior distributions.

Absolute loss function

It can be shown that solving

\[\hat{\theta_b}=\arg\min_{\hat{\theta}}\int|\theta-\hat{\theta}|p(\theta|y)d\theta\]

is equivalent to finding \(\hat{\theta}\) such that

\[\int_{-\infty}^{\hat{\theta}_B}p(\theta|y)d\theta = \int_{\hat{\theta}_B}^\infty p(\theta|y)d\theta\]

In other words, we need to find \(\hat{\theta}_B\) that divides the posterior probability density into 2 regions with equal probability mass:

\[\int_{-\infty}^{\hat{\theta}_B}p(\theta|y)d\theta = \int_{\hat{\theta}_B}^\infty p(\theta|y)d\theta=\frac{1}{2}\]

This is the definition of the posterior median.

0-1 loss function

This is also sometimes called the hit-or-miss loss function.

We need to find \(\hat{\theta}_B\) such that

\[\hat{\theta}_B=\arg\min_{\hat{\theta}}\int \mathcal{C}(\theta-\hat{\theta})p(\theta|y)d\theta\]

where

\[ \mathcal{C}(x)=\begin{cases} 1\qquad\mbox{if }|x|<\epsilon\\ 0\qquad\mbox{if }|x|\geq\epsilon \end{cases} \]

for some \(\epsilon>0\).

The integral becomes

\[ \begin{align} \int \mathcal{C}(\theta-\hat{\theta})p(\theta|y)d\theta &=& \int_{-\infty}^{\hat{\theta}-\epsilon}1\cdot p(\theta|y)d\theta + \int_{\hat{\theta}+\epsilon}^\infty 1\cdot p(\theta|y)d\theta \\ &=& 1-\int_{\hat{\theta}-\epsilon}^{\hat{\theta}+\epsilon}p(\theta|y)d\theta \end{align} \]

This is minimised by maximising \(\int_{\hat{\theta}-\epsilon}^{\hat{\theta}+\epsilon}p(\theta|y)d\theta\)

For small \(\epsilon\) and smooth \(p(\theta|y)\) this is maximised at the maximum, i.e. the mode, of the posterior distribution.

Note that the last of these estimator is usally not considered to be a Bayes estimator since it requires (fairly mild) conditions to exist (\(p(\theta|y)\) smooth and existence of a single mode) and is a limiting case (\(\epsilon\rightarrow0\)).

For this reason, maximum a posterior density estimation is often considered an alternative to Bayes estimation.

Credible intervals

It is often desirable to identify regions of the parameter space that have a high probability of containing the parameter. To do this we can construct, after observing some data \(y\), an interval (say) \([l(y),u(y)]\) such that the probability that \(\theta\in[l(y),u(y)]\) is high.

Bayesian coverage

An interval based on observed data \(Y=y\) has \(100\times (1-\alpha)\%\) Bayesian coverage for \(\theta\) if

\[P(l(y)<\theta<u(y)\,|\,Y=y)=1-\alpha\]

Note that here \(\theta\) is random, the interval is fixed.

Frequentist coverage

An random interval \([l(Y),u(Y)\) has \(100\times (1-\alpha)\%\) frequentist coverage for \(\theta\) if, before the data are collected,

\[P(l(Y)<\theta<u(Y)\,|\theta)=1-\alpha\]

Note that here \(\theta\) is fixed, the interval is random.

Bayesian intervals are usually called credible intervals and frequentist intervals are called confidence intervals. However, Bayesian confidence interval and frequentist confidence interval are also in use.

Bayesian confidence intervals also have frequentist coverage – see Hoff P. D. (2009), Sections 3.1.2 and 3.4 for a comment on this.

Anchoring a credible interval

Frequentist confidence intervals are often centered on the point estimate. If not centered on the point estimate, they at the very least contain it and so anchor the interval.

In a Bayesian setting this is a bit less straightforward: we could pick any interval along the support to get an interval with the desired coverage.

We could simply center the interval on a chosen posterior point estimate such as the posterior mean. There are other ways to construct such intervals though.

Quantile-based intervals

These are also called central posterior or equi-tailed intervals.

One simple recipe for constructing a \(100\times(1-\alpha)\%\) credible interval is by using posterior quantiles. We need to find \(\theta_{\alpha/2}\) and \(\theta_{1-\alpha/2}\) such that

  • \(P(\theta<\theta_{\alpha/2}|Y=y)=\alpha/2\)

  • \(P(\theta>\theta_{1-\alpha/2}|Y=y)=\alpha/2\)

\(\theta_{\alpha/2},\theta_{1-\alpha/2}\) are the quantile of the posterior distribution \(p(\theta|y)\).

It is easy to see that \[ \begin{align} P(\theta\in[\theta_{\alpha/2},\theta_{1-\alpha/2}]|y) &=& 1-(P(\theta<\theta_{\alpha/2}|y)+P(\theta>\theta_{1-\alpha/2}|y)) \\ &=& 1-\alpha \end{align} \]

Highest posterior density (HPD) regions

A \(100\times(1-\alpha)\%\) HPD region consists of a subset of the parameter space, \(s(y)\subset\Omega\) such that

  • \(P(\theta\in s(y)|Y=y)=1-\alpha\);

  • If \(\theta_a\in s(y)\), \(\theta_b\notin s(y)\), then \(p(\theta_a|y)\geq p(\theta_b|y)\).

A HPD region is not necessarily an interval (e.g. for multi-modal distributions, the HPD region can consist of a union of distinct intervals).

If the HPD region is an interval, it is the narrowest interval with \(100\times(1-\alpha)\%\) coverage.

Example

Suppose the posterior distribution \(p(\theta|y)\) is Beta\((5,2)\).

95% quantile-based interval:

\(q_{0.025;\beta(5,2)}=0.3588\) and \(q_{0.975;\beta(5,2)}=0.9567\)

\[\Rightarrow\mbox{ 95% quantile-based interval }= [0.3588,0.9567]\]

The width of this interval is \(q_{0.975;\beta(5,2)}-q_{0.025;\beta(5,2)} = 0.5980\)

95% HPD interval:

There’s no formula to find this; has to be found empirically by sliding a line down the y-axis on a graph of the density. In R, you can use the function hdi() from the HDInterval package (for example).

\[\Rightarrow\mbox{ 95% HPD interval }= [0.4094,0.9822]\]

The width of this interval is \(0.9822-0.4094 = 0.5728\).

Note that this is (slightly) narrower than the quantile-based interval.

Bayes factor

To compare 2 hypotheses, or 2 models, or 2 parameter values, we can compute the prior odds and compare these to the posterior odds:

\[ \frac{p(\theta_1|y)}{p(\theta_2|y)}=\frac{p(y|\theta_1)p(\theta_1)/p(y)}{p(y|\theta_2)p(\theta_2)/p(y)}=\frac{p(y|\theta_1)}{p(y|\theta_2)}\cdot\frac{p(\theta_1)}{p(\theta_2)} \]

where

  • \(\frac{p(\theta_1)}{p(\theta_2)}\) are the prior odds

  • \(\frac{p(\theta_1|y)}{p(\theta_2|y)}\) are the posterior odds

  • \(\frac{p(y|\theta_1)}{p(y|\theta_2)}\), the ratio of the marginal likelihoods, is called the Bayes factor

Bayes factor = how to update our beliefs having observed data.

The Bayes factor can be interpreted as how much the data favour one model or parameter value over another.

The Bayes factor can be used for model selection, but as it depends on the marginal likelihoods, which is highly sensitive to aspects of the model that are often arbitrary and cannot be tested, it is not recommended in practice as a model selection tool. It is better to compare models based on posterior predictive accuracy.

Note that there is a relationship between Bayes factors and frequentist p-values.

Markov Chain Monte Carlo (MCMC)

Monte Carlo approximation

In the examples we have seen so far, particularly when we used conjugate priors, we ended up with a posterior distribution for an unknown parameter \(\theta\) for which there existed simple formulae for posterior means and variances.

Often however we are interested in other aspects of the posterior distribution, e.g.

  • \(P(\theta\in A|y_1,\ldots,y_n)\) for arbitrary sets A.

  • posterior means and variances for functions of \(\theta\)

  • predictive distributions for missing or unobserved data

  • comparing two or more populations, so that we are interested in the posterior distribution for \(|\theta_1-\theta_2|\), \(\theta_1/\theta_2\) or \(\max{\theta_1,\ldots,\theta_n}\)

Obtaining exact values for these quantities can be difficult or impossible.

Trick:

  • Generate random samples for the parameters from their posterior distributions.

  • Use these samples to compute arbitrary quantities of interest to an arbitrary degree of precision.

Generating random samples \(\approx\) playing a game of chance. Since Monte Carlo is the most famous casino in the world, this approach was named the Monte Carlo approximation.

Recall the integral from Exercise 5, Practical 1&2:

\[\begin{align} P(\theta_1>\theta_2|...) &=& \int_0^\infty \gamma(\theta_2;68,45) \int_{\theta_2}^\infty\gamma(\theta_1;219,112)d\theta_1d\theta_2 \\ &=& \frac{112^{219}45^{68}}{\Gamma(219)\Gamma(68)}\int_0^\infty\int_0^{\theta_1}\theta_1^{218}\theta_2^{67}e^{-112\theta_1-45\theta_2}d\theta_2d \theta_1 \end{align}\]

This integral can be solved in several ways, but one way involves generating random samples from both gamma distributions and then approximating the integrals by summing over the random samples.

In Exercise 5, Practical 1&2, we solved the integral analytically by doing the integral (albeit by using a computer programme to help with this) and found \(P(\theta_1>\theta_2|...)=0.9726\).

Now let’s do it by sampling and approximating:

N<-1e5
theta1<-rgamma(n=N,shape=219,rate=112)
theta2<-rgamma(n=N,shape=68,rate=45)

sum(theta1>theta2)/N
[1] 0.97337

Let \(\theta\) be a parameter of interest and let \(y_1,\ldots,y_n\) be a sample from the sampling model distribution \(p(y_1,\ldots,y_n|\theta)\).

Suppose we can sample a number \(S\) of independent, random values for \(\theta\) from the posterior distribution \(p(\theta|y_1,\ldots,y_n)\):

\[\theta^{(1)},\ldots,\theta^{(S)}\sim_{\mbox{iid}}p(\theta|y_1,\ldots,y_n)\] The the empirical distribution of the samples \(\{\theta^{(1)},\ldots,\theta^{(S)}\}\) would approximate \(p(\theta|y_1,\ldots,y_n)\) with the approximation improving as \(S\) increases.

The empirical distribution of \(\{\theta^{(1)},\ldots,\theta^{(S)}\}\) is called the Monte Carlo approximation to \(p(\theta|y_1,\ldots,y_n)\).

Note: Monte Carlo works for any distribution, not just posteriors.

The Law of Large Numbers (LLN) is why Monte Carlo works: by the LLN

\[\frac{1}{S}\sum_s g\left(\theta^{(s)}\right)\rightarrow E[g(\theta)|y_1,\ldots,y_n]\mbox{ as }S\rightarrow \infty\] \[\,\]

where \(E[g(\theta)|y_1,\ldots,y_n] = \int g(\theta)p(\theta|y_1,\ldots,y_n)d\theta\).

For example, this means that as \(S\rightarrow\infty\):

  • \(\bar{\theta}=\sum_s \theta^{(s)}/S\rightarrow E[\theta|y_1,\ldots,y_n]\)

  • \(\sum_s\left(\theta^{(s)}-\bar{\theta}\right)^2/(S-1)\rightarrow Var(\theta|y_1,\ldots,y_n)\)

  • \(\#\left(\theta^{(s)}\leq c\right)/S\rightarrow P(\theta\leq c|y_1,\ldots,y_n)\)

  • the \(\alpha^{th}\) percentile of \(\{\theta^{(1)},\ldots,\theta^{(S)}\}\rightarrow\theta_\alpha\)

  • \(\ldots\)

We can approximate just about any aspect of the distribution \(p(\theta|y_1,\ldots,y_n)\) in this way and with an arbitrary degree of precision given a large enough sample.

Example

Recall for a \(\Gamma(a,b)\) prior with a \(Pois(\lambda)\) sampling model for some data \(y_1,\ldots,y_n\), the posterior distribution is \(\Gamma(a+\sum_i y_i,b+n)\).

In Practical 3, Exercise 3, we had \(a=5,b=2, n=18, \sum_i y_i = 40\), yielding a \(\Gamma(45,20)\) posterior.

Exercise:

Use Monte Carlo, with sizes \(S=10, 100, 1000, 10000\) to approximate

  • the posterior mean \(E[\lambda|y_1,\ldots,y_n]\)

  • the posterior probability \(P(\lambda<2.1|y_1,\ldots,y_n)\)

  • the 95% quantile-based Bayesian confidence interval

Solution

Let’s first draw some Monte Carlo samples:

set.seed(1234)

a<-5; b<-2
sy<-40; n<-18

s10<-rgamma(n=10,shape=a+sy,rate=b+n)
s100<-rgamma(n=100,shape=a+sy,rate=b+n) 
s1000<-rgamma(n=1000,shape=a+sy,rate=b+n)
s10000<-rgamma(n=10000,shape=a+sy,rate=b+n)

The posterior mean is given by \(\frac{a+\sum_i y_i}{b+n}=45/20=2.25\).

Approximated by Monte Carlo:

print(mean(s10))
[1] 2.127597
print(mean(s100))
[1] 2.263114
print(mean(s1000))
[1] 2.267247
print(mean(s10000))
[1] 2.249742

The posterior probability \(P(\lambda<2.1|y_1,\ldots,y_n)\) is given by

pgamma(2.1,a+sy,b+n)
## [1] 0.3417987

We can approximate this by Monte Carlo:

sum(s10<2.1)/10
## [1] 0.6
sum(s100<2.1)/100
## [1] 0.36
sum(s1000<2.1)/1000
## [1] 0.32
sum(s10000<2.1)/10000
## [1] 0.3317

The quantile based 95% Bayesian confidence interval is given by:

qgamma(c(0.025,0.975),a+sy,b+n)
## [1] 1.641165 2.953397

This too we can approximate using Monte Carlo, by taking empirical quantiles of the samples:

quantile(s10,probs=c(0.025,0.975))
##     2.5%    97.5% 
## 1.715649 2.527271
quantile(s100,probs=c(0.025,0.975))
##     2.5%    97.5% 
## 1.660528 3.134632
quantile(s100,probs=c(0.025,0.975))
##     2.5%    97.5% 
## 1.660528 3.134632
quantile(s1000,probs=c(0.025,0.975))
##     2.5%    97.5% 
## 1.678758 2.961095

We can plot how these quantities converge as \(S\rightarrow\infty\).

sVect<-1:1500
df<-data.frame(S=sVect,postMean=NA,postCdf=NA,postQ975=NA)
for(s in sVect){
  samp<-rgamma(n=s,shape=a+sy,rate=b+n)
  df$postMean[df$S==s]<-mean(samp)
  df$postCdf[df$S==s]<-sum(samp<2.1)/s
  df$postQ975[df$S==s]<-quantile(samp,probs=0.975)
}
titles<-paste("posterior",c("mean","cdf at 2.1","97.5% quantile"))
cols<-c("steelblue","greenyellow","salmon")
hVect<-c((a+sy)/(b+n),pgamma(2.1,a+sy,b+n),qgamma(0.975,a+sy,b+n))

par(mfrow=c(3,1),mar=c(5,7,2,1))
for(i in 1:3){
  plot(df$S,df[,1+i],type="l",xlab="number of Monte Carlo samples",ylab="Monte Carlo approx.",cex.lab=2.5,cex.axis=2.5,cex.main=2.5,lwd=3,col=cols[i],main=titles[i])
  abline(lwd=3,lty=2,h=hVect[i],col="darkgrey")
}

Posterior inference for arbitrary functions

Often we are interested in the posterior distribution of some function \(\gamma=g(\theta)\) of the parameter \(\theta\). For examples, in the binomial model we often are interested in the logodds: \(\gamma=log\frac{\theta}{1-\theta}\).

Using Monte Carlo, we can approximate any aspect of the posterior distribution \(p(g(\theta)|y_1,\ldots,y_n)\):

\[ \mbox{independently} \begin{cases} \mbox{sample }\theta^{(1)}\sim p(\theta|y_1,\ldots,y_n)\mbox{, compute }\gamma^{(1)}=g\left(\theta^{(1)}\right) \\ \mbox{sample }\theta^{(2)}\sim p(\theta|y_1,\ldots,y_n)\mbox{, compute }\gamma^{(2)}=g\left(\theta^{(2)}\right) \\ \ldots \\ \mbox{sample }\theta^{(S)}\sim p(\theta|y_1,\ldots,y_n)\mbox{, compute }\gamma^{(S)}=g\left(\theta^{(S)}\right) \\ \end{cases} \]

Then as before, we can compute:

  • \(\bar{\gamma}=\sum_s \gamma^{(s)}/S\rightarrow E[\gamma|y_1,\ldots,y_n]\)

  • \(\sum_s\left(\gamma^{(s)}-\bar{\gamma}\right)^2/(S-1)\rightarrow Var(\gamma|y_1,\ldots,y_n)\)

  • \(\ldots\)

Sampling from predictive distributions

New data \(\tilde{Y}\) are generated by the same sampling model as the observed data \(y_1,\ldots,y_n\):

\[\mbox{SAMPLING MODEL }\qquad\qquad\qquad\tilde{Y}\sim p(\tilde{y}|\theta)\]

We cannot predict from this model however, as \(\theta\) is random: we need to integrate it out:

\[\mbox{PREDICTIVE MODEL }\quad p(\tilde{y})=\int p(\tilde{y}|\theta)p(\theta)d\theta\]

The above is a prior predictive distribution as we have not conditioned on observed data \(y_1,\ldots,y_n\).

After we have observed data, we obtain the posterior predictive distribution:

\[\begin{align} p(\tilde{y}|y_1,\ldots,y_n) &=& \int p(\tilde{y}|\theta,y_1,\ldots,y_n)p(\theta|y_1,\ldots,y_n)d\theta \\ &=& \int p(\tilde{y}|\theta)p(\theta|y_1,\ldots,y_n)d\theta \end{align}\]

For a \(\Gamma(a,b,)\) prior and a \(\mbox{Pois}(\lambda)\) sampling model, we saw (Exercise 4, Practical 1&2) that the posterior predictive distribution was \(\mbox{NegBin}(a+\sum_i y_i),(b+n)/(b+n+1))\).

In many situations \(p(\tilde{y}|y_1,\ldots,y_n)\) is too complicated to sample from directly. However we often are able to sample from \(p(\theta|y_1,\ldots,y_n)\) and \(p(y|\theta)\).

We can then obtain samples from the posterior predictive distribution as follows:

\[ \mbox{independently} \begin{cases} \mbox{sample }\theta^{(1)}\sim p(\theta|y_1,\ldots,y_n)\mbox{, sample }\tilde{y}^{(1)}\sim p\left(\tilde{y}|\theta^{(1)}\right) \\ \mbox{sample }\theta^{(2)}\sim p(\theta|y_1,\ldots,y_n)\mbox{, sample }\tilde{y}^{(2)}\sim p\left(\tilde{y}|\theta^{(2)}\right) \\ \ldots \\ \mbox{sample }\theta^{(S)}\sim p(\theta|y_1,\ldots,y_n)\mbox{, sample }\tilde{y}^{(n)}\sim p\left(\tilde{y}|\theta^{(n)}\right) \end{cases} \]

\(\left\{\left(\theta^{(1)},\tilde{y}^{(1)}\right),\ldots,\left(\theta^{(S)},\tilde{y}^{(S)}\right)\right\}\) constitutes \(S\) independent samples from the joint posterior distribution of \((\theta,\tilde{Y})\).

\(\left\{\tilde{y}^{(1)},\ldots,\tilde{y}^{(n)}\right\}\) constitutes \(S\) independent samples from the marginal posterior distribution of \(\tilde{Y}\), i.e. the posterior predictive distribution.

An important use of sampling from the posterior predictive distribution is for assessing model fit:

  • Do samples from the posterior predictive distribution look like the actual observed data?

  • How likely are certain aspects of the observed data to be occurring under the posterior predictive distribution?

Markov Chain Monte Carlo

Bayesian inference: multi-parameter models

Bayesian inference for two or more unknown parameters is not conceptually different from the one parameter case.

E.g. for a normal sampling model with parameters \((\mu,\sigma^2)\) with joint prior distribution \(p(\mu,\sigma^2)\), posterior inference proceeds using Bayes’ rule:

\[p(\mu,\sigma^2|y_1,\ldots,y_n)=\frac{p(y_1,\ldots,y_n|\mu,\sigma^2)p(\mu,\sigma^2)}{p(y_1,\ldots,y_n)}\]

However, for many multiparameter models, the joint posterior distribution is non-standard and difficult to sample from directly.

The distributions \(p(\mu|\sigma^2,y_1,\ldots,y_n)\) and \(p(\sigma^2|\mu,y_1,\ldots,y_n)\) are called the full conditional distributions of \(\mu\) and \(\sigma^2\) as they are conditional distributions for a single parameter given everything else.

We have already seen (Exercise 8, Practical 1&2) that if

\[ \begin{cases} Y_1,\ldots,Y_n|\mu,\sigma^2\sim\mathcal{N}(\mu,\sigma^2) \\ \mu\sim\mathcal{N}(\mu_0,\sigma_0^2) \end{cases} \]

then

\[\mu|\sigma^2,y_1,\ldots,y_n\sim\mathcal{N}(\mu_n,\sigma_n^2)\]

where \(\mu_n=\frac{\mu_0/\sigma_0^2 + n\bar{y}/\sigma^2}{1/\sigma_0^2+n/\sigma^2}\) and \(\sigma_n^2=(1/\sigma_0^2+n/\sigma^2)^{-1}\).

If we write

\[p(\mu,\sigma^2)=p(\mu|\sigma^2)p(\sigma^2)\]

and if we assume

\[1/\sigma^2\sim\Gamma(\nu_0/2,\nu_0\tau_0^2/2)\]

Then it can be shown that

\[1/\sigma^2|\mu,y_1,\ldots,y_n\sim\Gamma(\nu_n/2,\nu_n\tau_n^2(\mu)/2)\]

where

\[ \begin{cases} \nu_n &=& \nu_0+n \\ \tau_n^2(\mu) &=& [\nu_0\tau_0^2+ns_n^2(\mu)]/\nu_n \\ s_n^2(\mu) &=& \sum_i(y_i-\mu)^2 \end{cases} \]

So to summarise, for a \(\mathcal{N}(\mu,\sigma^2)\) sampling model:

  • conditionally on \(\sigma^2\), the normal distribution is a conjugate prior for \(\mu\)

  • conditionally on \(\mu\), the inverse gamma distribution is a conjugate prior for \(\sigma^2\)

This is called a semi-conjugate or conditionally conjugate prior for \((\mu,\sigma^2)\).

Note this does not guarantee that the resulting joint distribution \(p(\mu,\sigma^2)\) is conjugate for \((\mu,\sigma^2)\).

Note if we have

\[\,\] \[1/X\sim\Gamma(a,b)\] \[\,\]

Then X follows an inverse Gamma distribution:

\[ X\sim\mbox{inv-}\Gamma(a,b) \]

Under this semi-conjugate prior for \((\mu,\sigma^2)\), given a starting sample \(\sigma^{2\,(0)}\), we can sample a value \(\mu^{(1)}\) from \(p(\mu|\sigma^{2(0)},y_1,\ldots,y_n)\).

\((\mu^{(1)},\sigma^{2 (0)})\) will be a sample from \(p(\mu,\sigma^2|y_1,\ldots,y_n)\).

But now, \(\mu^{(1)}\) can also be considered a sample from the marginal distribution \(p(\mu|y_1,\ldots,y_n)\) and we can then use this to sample a value \(\sigma^{2(1)}\) from \(p(\sigma^2|\mu^{(1)},y_1,\ldots,y_n)\).

Now \((\mu^{(1)},\sigma^{2(1)})\) can be considered a sample from \(p(\mu,\sigma^2|y_1,\ldots,y_n)\) and so \(\sigma^{2(1)}\) can be considered a sample from the marginal distribution \(p(\sigma^2|y_1,\ldots,y_n)\) and this can be used to generate a new value \(\mu^{(2)}\) and so on.

This is the principle of the Gibbs sampler.

Gibbs sampler

Suppose you have a vector of parameters \(\mathbf{\phi}=(\phi_1,\ldots,\phi_p)\) and a joint distribution \(p(\mathbf{\phi})\). Given a starting value \(\mathbf{\phi}^{(0)}=(\phi_1^{(0)},\ldots,\phi_p^{(0)})\), the Gibbs sampler generates \(\mathbf{\phi}^{(s)}\) from \(\mathbf{\phi}^{(s-1)}\) as follows:

\[\begin{cases} \mbox{sample }\quad \phi_1^{(s)}\sim p(\phi_1|\phi_2^{(s-1)},\phi_3^{(s-1)},\ldots,\phi_p^{(s-1)}) \\ \mbox{sample }\quad \phi_2^{(s)}\sim p(\phi_2|\phi_1^{(s-1)},\phi_3^{(s-1)},\ldots,\phi_p^{(s-1)}) \\ \ldots \\ \mbox{sample }\quad \phi_p^{(s)}\sim p(\phi_p|\phi_1^{(s-1)},\phi_2^{(s-1)},\ldots,\phi_{p-1}^{(s-1)}) \end{cases}\]

This generates a dependent sequence of vectors

\[ \begin{align} \mathbf{\phi}^{(1)} &=& (\phi_1^{(1)},\ldots,\phi_p^{(1)}) \\ \mathbf{\phi}^{(2)} &=& (\phi_1^{(2)},\ldots,\phi_p^{(2)}) \\ \ldots & & \\ \mathbf{\phi}^{(S)} &=& (\phi_1^{(S)},\ldots,\phi_p^{(S)}) \\ \end{align} \]

where each \(\mathbf{\phi}^{(s)}\) depends on \(\mathbf{\phi}^{(0)}, \ldots, \mathbf{\phi}^{(s-1)}\) only through \(\mathbf{\phi}^{(s-1)}\).

This is called the Markov property and so the sequence is called a Markov chain.

Under conditions met for all models discussed in this course module, for any set \(A\):

\[P(\mathbf{\phi}^{(s)}\in A)\rightarrow\int_A p(\mathbf{\phi})d\mathbf{\phi}\;\mbox{ as }\; s\rightarrow\infty\]

In other words, the sample distribution of \(\mathbf{\phi}^{(s)}\) approaches the target distribution as \(s\rightarrow\infty\) regardless of the starting value \(\mathbf{\phi}^{(0)}\).

More importantly

\[\frac{1}{S}\sum_s g\left(\mathbf{\phi}^{(s)}\right)\rightarrow E[g(\mathbf{\phi})]=\int g(\mathbf{\phi})p(\mathbf{\phi})d\mathbf{\phi}\;\mbox{as }\;S\rightarrow\infty\]

This means we can approximate \(E[g(\mathbf{\phi})]\) with the sample average of \(\left\{g(\mathbf{\phi}^{(1)}),\ldots,g(\mathbf{\phi}^{(S)})\right\}\) just as we did in Monte Carlo approximation.

For this reason, we call such approximations Markov Chain Monte Carlo (MCMC).

Metropolis-Hastings algorithm

The Gibbs sample is very powerful, but:

  • It assumes we can sample easily from the conditional distributions. This is usually the case for textbook examples, but can be tricky in many real-world examples.

  • It assumes conditional independence between variables – if you have very complex dependencies between parameters, the Gibbs sampler can be difficult to implement.

The Gibbs sampler is a special case of a more general algorithm, the Metropolis-Hastings (MH) algorithm.

MH directly samples from the full joint distribution and uses an iteration of proposal and acceptance/rejection steps (the Gibbs sampler is a special case in the sense that we always accept a sampled value).

The Metropolis-Hastings algorithm requires:

  1. A target density \(p(\mathbf{\phi})\).

  2. A function \(f(\mathbf{\phi})\) that is proportional to \(p(\mathbf{\phi})\). This is key in making the algorithm feasible: we do not need to compute the normalising constant (which can be very hard for complex, high-dimensional densities).

  3. A proposal distribution \(g(\mathbf{\phi}^{'}|\mathbf{\phi}^{(k)})\) which we use to sample a new value for the \(k+1^{\mbox{th}}\) iteration.

Given a target density, a proportional function and a proposal density, MH proceeds as follows:

  1. Start with an arbitrary starting value \(\mathbf{\phi}^{(0)}\).

  2. For each iteration \(k=1,2\dots,N\):

  1. Sample a new proposed value \(\mathbf{\phi}^{'}\) from \(g(\mathbf{\phi}^{'}|\mathbf{\phi}^{(k-1)})\).

  2. Calculate the acceptance ratio \(\alpha=f(\mathbf{\phi}^{'})/f(\mathbf{\phi}^{(k-1)})\). Since \(f\) is proportional to \(g\): \(\alpha=p(\mathbf{\phi}^{'})/p(\mathbf{\phi}^{(k-1)})\).

  3. Sample a random value \(u\) between 0 and 1 from a uniform distribution.

    • If \(u\leq\alpha\), accept the proposed value by setting \(\mathbf{\phi}^{(k)}=\mathbf{\phi}^{'}\).

    • Else, reject the proposed value and set \(\mathbf{\phi}^{(k)}=\mathbf{\phi}^{(k-1)}\).

Each chain in the resulting MCMC is initialised from an arbitrary starting value \(\mathbf{\phi}^{(0)}\). Just like the Gibbs sampler, Metropolis-Hastings is guaranteed to converge eventually to the target distribution.

However, we only run chains of finite lengths, and given the dependent nature of the values in the chain, it is important to run several chains, to avoid getting stuck in some regions of the target probability density.

Ideally, after some initial iterations, called burn-in, all chains converge to the same area of the sampling space. If this is the case, we say that the chains are mixing well. If not, then this can indicate issues with the model we specified.

We have seen that Gibbs is a special case of the Metropolis-Hastings algorithm (sampling from the joint distribution using conditional distributions and always accepting proposed values).

But other algorithms are possible:

  • If we use a symmetric, Gaussian \(g(\mathbf{\phi}^{'}|\mathbf{\phi}^{(k)})=g(\mathbf{\phi}^{(k)}|\mathbf{\phi}^{'})\), i.e. a Gaussian random walk, then MH reduces to the simpler Metropolis algorithm.

  • If we use Hamiltonian dynamics evoluation for the proposal desnity, we obtain Hamiltonian Monte Carlo. An extension of this, is the No U-Turns Sampler (NUTS).

These are just some of the possible samplers implemented in modern MCMC software - consult the documentation of your software to find out what is implemented. NIMBLE even allows you to code up your own sampler!

NIMBLE

NIMBLE is an R package to implement Bayesian models using MCMC for parameter estimation. NIMBLE stands for Numerical Inference for statistical Models for Bayesian and Likelihood Estimation.

NIMBLE was developed by a team based at University of California, Berkeley and has now a maintainer team spanning several institutions. NIMBLE is based on the Bayesian inference using Gibbs sampling (BUGS) Bayesian computation environment developed by the MRC Biostatistics Unit at Cambridge University and uses largely the same syntax.

NIMBLE provides, however, more functionality than BUGS, in particular providing access to more modern MCMC samplers, such as Hamiltonian Monte Carlo (often through additional R packages such as nimbleHMC). NIMBLE also aims to make it easy for users to extend the functionality of BUGS themselves, by adding their own samplers, distributions, …

NIMBLE provides the MCMC framework, but other packages are useful for processing the output from NIMBLE. Chief among these are the coda and MCMCvis packages.

NIMBLE is installed and loaded just like any other R package: install.packages("nimble") and library(nimble) - no separate downloads and installations required.

Alternatives to NIMBLE

To note that there are other packages that allow you to implement Bayesian models in R. All of these require you to install the MCMC sampler as a software library separate from R, then install R packages to interface with the sampler. All of these have the advantage that they are not dependent on R, whereas NIMBLE, at least currently, is specific to R.

Given R’s near universal adoption in the statistical research world, this is however not much of a concern.

  • OpenBUGS, interfaced in R using the package R2OpenBUGS; this is an implementation of BUGS.

  • JAGS, interfaced in R using the package rjags; this is similar to BUGS but tries to figure out the best sampler (e.g. Gibbs, Metropolis-Hastings, Slicer, …) and its parameters for you. JAGS is great, but is maintained by only one and now retired researcher – updates are only very occasional and there is a risk that this MCMC package will not be supported in the future. This course used to be taught using JAGS in the past.

  • Stan, interfaced in R using the package RStan; the main advantage of Stan used to be that it was the only way to run Hamiltonian Monte Carlo. With the advent of NIMBLE, this has now changed. Specifying models in Stan is somewhat less intuitive than in BUGS-based libraries. Note that there are some packages, e.g. brms, that use Stan under the hood - you specify the model you want to fit in a more traditional fashion and the package writes and run the Stan model for you.

NIMBLE resources

NIMBLE is well documented with the primary source for documentation and tutorials being the NIMBLE website:

https://r-nimble.org/

Specifically, this website has the following resources:

Many additional training materials are hosted on GitHub: https://github.com/nimble-training.

Note on NIMBLE model code

Note that NIMBLE coding syntax is very similar to R syntax but with some differences, particularly for specifying distribution functions For example the normal distribution takes the mean and precision not standard deviation as default parameterisation (recall thatbif \(\tau\) is the precision and \(\sigma\) the standard deviation of a normal distribution, then \(\tau=1/\sigma^2\)).

However, unlike other BUGS-based languages, NIMBLE has made efforts to allow alternative, R-like paramterisations: for example dnorm(0,0.01) and dnorm(0,sd=10) are equivalent.

Fitting models with NIMBLE

Fitting a NIMBLE model in R requires 3 things:

  1. \(\;\) Dataset.

  2. \(\;\) NIMBLE model code.

  3. \(\;\) R script to pre-process the data, hold the NIMBLE model code, build and run the model using NIMBLE, process the resulting samples from the posterior density.

Example

  • Fit the model from Practical 3, Exercise 2 using NIMBLE.

  • Inspect the trace plot and plot the posterior distribution.

  • Compute the posterior mean and the quantile-based 95% Bayesian confidence interval.

First we need to specify the model using NIMBLE / BUGS code. In R, load the nimble package and use the function nimbleCode() for this:

library(nimble)
nimble version 1.3.0 is loaded.
For more information on NIMBLE and a User Manual,
please visit https://R-nimble.org.

Note for advanced users who have written their own MCMC samplers:
  As of version 0.13.0, NIMBLE's protocol for handling posterior
  predictive nodes has changed in a way that could affect user-defined
  samplers in some situations. Please see Section 15.5.1 of the User Manual.

Attaching package: 'nimble'
The following object is masked from 'package:stats':

    simulate
The following object is masked from 'package:base':

    declare
modCode<-nimbleCode(
  {
    # sampling model
    for(i in 1:N){
      y[i]~dbern(pi)
    }
    
    # prior
    pi~dbeta(2,3)
  }
)

This specifies the model. Now we need to fit this model using MCMC.

Below is a one-line invokation in NIMBLE MCMC invocation. Please note that that niter - nburnin is the number of posterior samples per chain.

parsPosterior<-nimbleMCMC(
  code=modCode,
    # this specifies the model code
  constants=list(N=25),
    # this lists all constants
  data=list(y=c(rep(1,16),rep(0,25-16))),
    # this lists the data
  monitors=c("pi"),
    # this specifies the parameter(s) we want to sample during the MCMC
  nchains=4,
    # this specifies how many chains we want to run
  niter=1100,
    # this specifies the total number of iterations to run
  nburnin=100,
    # this specifies the number of burn-in iterations
  thin=1,
    # values larger than 1 discard samples after burn-in (thin=10 will only keep every 10th sample)
    # these days it is usually recommended NOT to thin
  WAIC=TRUE,
    # calculates the WAIC online as sampling progresses; this is useful to compare models
  samplesAsCodaMCMC=TRUE
    # specify this to be able to handle the output easily with other MCMC packages
)

Apart from the one-line invocation, you can also do a full NIMBLE model and MCMC workflow. This takes more coding, but provides some advantages, including:

  • Finer control over parameters.
  • Compiling code for faster runtime.

For the rest, the results will be identical to the one-line invocation.

# create model object
mod<-nimbleModel(
  code=modCode,
  constants=list(N=25),
  data=list(y=c(rep(1,16),rep(0,25-16))),
)

# set up the MCMC
modConf<-configureMCMC(
  model=mod,
  monitors=c("pi"),
  enableWAIC=TRUE
) # default random walk Metropolis sampler

# build the MCMC
modMCMC<-buildMCMC(
  conf=modConf
)

# compilation (optional - but next step will be faster if you do this first)
compMCMCCode<-compileNimble(mod)
compMCMCModel<-compileNimble(modMCMC,project=mod) 

# run the MCMC (10,000 samples with 1,000 burn-in iterations)
parsPosterior<-runMCMC(
  mcmc=compMCMCModel, # specify modMCMC to run the uncompiled model if you skipped the compilation step
  nchains=4,
  niter=1100,
  nburnin=100,
  WAIC=TRUE,
  samplesAsCodaMCMC=TRUE
)

MCMC diagnostics

The coda library has useful functions to check your MCMC output.

library(coda)

# In all the code below, if you did not specify 'WAIC=TRUE' in the NIMBLE model,
# then you can drop the '$samples' part; i.e. plot(parsPosterior) etc

# check trace plot, empirical posterior distribution, potential scale reduction factor
plot(parsPosterior$samples)

gelman.diag(parsPosterior$samples)
## Potential scale reduction factors:
## 
##    Point est. Upper C.I.
## pi          1          1

# posterior mean estimate
summary(parsPosterior$samples)$statistics["Mean"]
##      Mean 
## 0.5977244

# posterior quantile based 95% credible interval
summary(parsPosterior$samples)$quantiles[c("2.5%","97.5%")]
##      2.5%     97.5% 
## 0.4214989 0.7591437

You can get the same summaries in a single output by using the MCMCsummary() function from the MCMCvis package:

library(MCMCvis)

MCMCsummary(parsPosterior$samples)
##         mean         sd      2.5%       50%     97.5% Rhat n.eff
## pi 0.5977244 0.08724512 0.4214989 0.5999392 0.7591437    1  4000

Burn-in

MCMC algorithms usually take a few iterations (how many?) to move into the support region for a parameter.

We have seen that the Gibbs sampler (and Metropolis-Hastings more generally) is guaranteed to converge to the target distributions, but we said nothing about how long this would take.

For this reason, the initial iterations (the burn-in) are generally discarded as they are not samples from the target distribution.

How many to discard: depends on model, sampler, initial value…

A trace plot can help to check if a longer burn-in is required.

Mixing

It is important that the MCMC sampler explores the parameter space efficiently. We want to avoid iterations where the MCMC samples stay ‘flat’ in one region and also we want to avoid many steps in the same direction.

To avoid that a single chain gets stuck in one part of the parameter space, we should run multiple chains and check that they all explore similar regions of the parameter space.

Auto-correlations / thinning

We saw that MCMC chains are dependent sequences of samples – there will be autocorrelations between samples. Autocorrelations should drop off rapidly with increasing lag. If not, a chain can be thinned: e.g. keep only every 10th sample in the chain.

Generally thinning only makes sense for storage requirements: a long chain will average out autocorrelations and this results in higher-precision estimates than when the chain is thinned.

Autocorrelations can also be assessed by computing the effective sample size for each sampled parameter - this is the equivalent number of iterations if there were no autocorrelation. It indicates the information content of the MCMC process. Crucially, this is the sample size determining the convergence in the MCMC CLT.

General diagnostics

To diagnose / check Bayesian models, fitted using MCMC, there are 2 main types of checks to make:

  1. Posterior predictive checks – this is a general framework that Bayesian statistics provide to allow testing model assumptions. WAIC, which we specified to be computed in the earlier example, is the Widely Applicable Information Criterion (also known as Watanabe-Akaike Information Criterion) and is a metric of out-of-sample predictive accuracy.

  2. MCMC diagnostics – this is to check that the MCMC algorithm seems to have converged (we can never be sure of this, but we can check for obvious non-convergence)

Here we focus (mostly) on the latter. There are many diagnostics that can be computed, but the main ones include:

  • Trace plots - should look like a hairy caterpillar if the posterior has converged to the stationary distribution.

  • Empirical posterior distributions of parameters in the model (and compare to prior) - should look sensible.

  • Potential scale reduction factor (Gelman-Rubin’s convergence diagnostic) - should be close to 1.

  • Effective sample size for the different sampled parameters; gives an indication of autocorrelations and indicates whether more iterations should be run to guarantee that the MCMC CLT holds.

  • WAIC or similar metrics - for comparing different models fitted to the same data.

Alternatives to NIMBLE and alternatives to MCMC

Alternatives to NIMBLE

  • JAGS

  • Stan

  • Bayes-X

  • BUGS

Alternatives to MCMC

  • Integrated nested Laplace approximation (INLA)

EXAMPLE: Bayesian linear regression model

Simulate the following data:

set.seed(1309)

N<-50

x1<-rexp(N,rate=2)
x2<-rnorm(N,mean=-2,sd=0.5)
x3<-rpois(N,lambda=20)
eps<-rnorm(N,mean=0,sd=0.75)

y<-5+x1+0.1*x3+eps

df<-data.frame(y=y,x1=x1,x2=x2,x3=x3)

Fit the following regression model using NIMBLE:

\[Y=\beta_0+\beta_1 X_1+\beta_2 X_2+\beta_3 X_3+\epsilon\]

where \(\epsilon\sim\mathcal{N}(0,\sigma^2)\)

The NIMBLE model code:

modGLM<-nimbleCode(
  {
    for(i in 1:N){
      y[i]~dnorm(yhat[i],tau)
      yhat[i]<-b0+b1*x1[i]+b2*x2[i]+b3*x3[i]
    }
    b0~dnorm(0,0.0001)
    b1~dnorm(0,0.0001)
    b2~dnorm(0,0.0001)
    b3~dnorm(0,0.0001)
    tau<-pow(sigma,-2)
    sigma~dgamma(0.1,0.1)
  }
)
set.seed(123) # (optional) for reproducibility

parsPosterior<-nimbleMCMC(
  code=modGLM,
  constants=list(N=nrow(df)),
  data=list(y=df$y,x1=df$x1,x2=df$x2,x3=df$x3),
  monitors=c("b0","b1","b2","b3","sigma"),
  nchains=4,
  niter=11000,
  nburnin=1000,
  samplesAsCodaMCMC=TRUE
)
# check trace plot, empirical posterior distribution, potential scale reduction factor
par(mfrow=c(2,5))

traceplot(parsPosterior)
densplot(parsPosterior)

gelman.diag(parsPosterior)
Potential scale reduction factors:

      Point est. Upper C.I.
b0          1.01       1.03
b1          1.00       1.00
b2          1.01       1.03
b3          1.00       1.01
sigma       1.00       1.00

Multivariate psrf

1.01
effectiveSize(parsPosterior)
       b0        b1        b2        b3     sigma 
 559.5317 8599.6708  959.9527  901.1931 7513.6723 
# posterior mean estimate
summary(parsPosterior)$statistics[,"Mean"]
##        b0        b1        b2        b3     sigma 
## 5.4512950 0.9515548 0.1977067 0.1037935 0.7542105
# posterior quantile based 95% credible interval
summary(parsPosterior)$quantiles[,c("2.5%","97.5%")]
            2.5%     97.5%
b0     4.1083535 6.7691517
b1     0.5023994 1.4010345
b2    -0.2759913 0.6807954
b3     0.0567035 0.1494124
sigma  0.6166397 0.9308013

The MCMCvis package provides a way to combine the parameter summaries with parameter-specific diagnostics:

MCMCsummary(parsPosterior)
           mean         sd       2.5%       50%     97.5% Rhat n.eff
b0    5.4512950 0.67978300  4.1083535 5.4460185 6.7691517 1.01   560
b1    0.9515548 0.22849428  0.5023994 0.9510276 1.4010345 1.00  8600
b2    0.1977067 0.24195086 -0.2759913 0.1992354 0.6807954 1.01   960
b3    0.1037935 0.02350058  0.0567035 0.1037339 0.1494124 1.00   901
sigma 0.7542105 0.07993037  0.6166397 0.7470167 0.9308013 1.00  7514

Note that you can access all the individuals MCMC samples: parsPosterior is a list with 4 elements, 1 for each chain. Each element is simply a data frame with a column for every variable in the variable.names argument and the number of rows corresponds to n.iter/thin.

Use this to produce your own trace and histogram plots, compute various statistics of the parameters etc!

parsPosterior[[1]][1:10,]
##             b0        b1          b2         b3     sigma
##  [1,] 5.556194 0.7622882 0.006818571 0.08196353 0.7354032
##  [2,] 5.649104 0.8770670 0.056329434 0.07998191 0.8296416
##  [3,] 5.845487 0.5654574 0.014918379 0.08356947 0.8296416
##  [4,] 5.593312 1.0234984 0.182497236 0.09136098 0.6499593
##  [5,] 5.689520 1.1160244 0.201338173 0.08989110 0.6499593
##  [6,] 5.782594 0.8341678 0.139319639 0.08703490 0.6499593
##  [7,] 5.684924 1.0392181 0.086439855 0.08310488 0.8036415
##  [8,] 5.748479 0.5939684 0.074928230 0.08508148 0.7925079
##  [9,] 6.043985 0.5271855 0.228664027 0.08500948 0.7925079
## [10,] 6.116821 0.8140915 0.414385238 0.09293210 0.7925079

Posterior predictive checks

More fundamental than the MCMC diagnostics, we also need to make sure the model itself is suitable for the data.

In general there can be 2 main reasons why a Bayesian model may not fit well:

  1. Miss-specified prior: typically too strong a prior that conflicts with the observed data.

  2. Miss-specified sampling model: some of the model assumptions are not met by the data.

The Bayesian paradigm, unlike the frequentist pardigm, provides a very general framework for checking model assumptions: posterior predictive checks.

The idea is simple, but powerful:

  • Simulate data from the posterior predictive distribution.

  • Compare this to the observed data.

    • Define a statistic that allows checking a model assumption.
    • Compute this for the observed data and then for a large number of simulated datasets.
    • Compute the proportion of times that the statistic estimates from the predicted data are as extreme as the one from the observed data – if this is too low it can indicate a violated assumption.

Example 1

Suppose we assume

  • A binomial sampling model \(Y_i\sim\mbox{Bern}(\pi)\).
  • A beta prior \(\pi\sim\mbox{Beta}(200,50)\).

And let’s suppose we observe the following data

  • \(0,1,0,0,0,0,0,0,1,1,0,0,0,0,0\) (i.e. n=15, k=3)

We have seen that the posterior distribution will be \(\mbox{Beta}(200+3,50+15-3)=\mbox{Beta}(203,62)\)

The posterior predictive distribution is Bernoulli with parameter \(\pi_{post}\):

\[ \begin{align} \pi_{post} &=p(\tilde{Y}=1|n=15,k=3)]\\ &=\int p(\tilde{Y}=1|\pi) p(\pi|n=15,k=3)d\pi\\ &=\int\pi\frac{\Gamma(265)}{\Gamma(203)\Gamma(62)}\pi^{202}(1-\pi)^{61}d\pi\\ &=\frac{\Gamma(265)}{\Gamma(203)\Gamma(62)}\frac{\Gamma(204)\Gamma(62)}{\Gamma(266)}=\frac{203}{265}\\ &=0.7660 \end{align} \]

Writing \(Y=\sum_{i=1}^{15}Y_i\) for \(Y_i\sim_{iid}Bern(\pi_{post})\), then \(Y\sim Bin(n=15,\pi_{post})\). Using this we can calculate \(P(Y\leq3|\pi_{post},n=15)=5.93\cdot10^{-6}\).

So we are saying that based on the fact that we observed 3 successes among 15 trials, it is extremeley unlikely to observe 3 or fewer successes among 15 trials.

This clearly indicates a problem with our posterior model and it is due to a very strong prior favouring a high success rate and very little observed data with low success rate. In this case, the prior is 1) at loggerheads with the data and 2) too strong.

Example 2

Suppose we want to estimate the probability of in-hospital death for A & E patients. Suppose we conduct a study where we recruit \(n=30\) consecutive patients attending A & E and we record \(Y_i=0\) if patient \(i\) is alive at discharge and \(Y_i=1\) if they died.

We can develop a Bayesian model by assuming:

  • A \(\mbox{Beta}(1/2,1/2)\) prior.

  • A \(\mbox{Bernoulli}(\pi)\) sampling model, i.e. \(Y_i|\pi\sim_\mbox{iid}\mbox{Bern}(\pi)\).

Now suppose we observe the following sequence of data where half-way through our study, there was a black-out at the hospital and a lot of life-saving equipment was not working:

\[0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1\]

In other words: all but 1 of the first 15 patients survived but all except 2 of the last 15 died.

This seems to be at odds with our assumption of independent data.

How can we test this independence assumption?

The data are consistent with \(\pi=0.5\), but under the assumption of independence, we would expect a random pattern of 0’s and 1’s - in other word we would expect regular switches between 0 and 1 in the sequence of outcomes.

Let us define

\[T=\sum_{i=1}^{30-1} I(Y_i\ne Y_{i+1})\]

For the observed data: \(T=5\) (5 switches).

We know that

\[ \begin{cases} \pi &\sim Beta(a,b) \\ y_1,\ldots,y_n|\pi &\sim Bern(\pi) \end{cases} \] \[\Rightarrow Y_{n+1}|y_1,\ldots,y_n\sim Bern\left(\pi_{post}=\frac{a+\sum_i y_i}{a+b+n}\right)\]

So here we have \(a=1/2, b=1/2, n=30, \sum_i y_i = 15\) and hence \(\pi_{post}=\frac{15.5}{31}=0.5\).

So what we can do now, is repeatedly simulate data from the posterior predictive distribution (30 observations at a time, i.e. \(\tilde{Y}\sim Bin(n=30,\pi_{post})\)) and compute T each time, to compute how likely it would be to observe \(T\leq5\) under the posterior predictive distribution.

N<-1e5
t<-integer(N)
for(i in 1:N){
  y_ppc<-rbinom(n=30,size=1,prob=0.5)
  t[i]<-sum(y_ppc[1:(length(y_ppc)-1)]!=y_ppc[2:length(y_ppc)])
}
sum(t<=5)/N
[1] 0.00039

So we find that the empirical probability of observing 5 or fewer switches is 3.9^{-4}. In other words, according to our model and the data we have observed, it is very unlikely that we would have observed the data we observed!

This is evidence that our model is not correct and in this case it is due to a wrong assumption of independence.